Distributional Energy-Based Models for Uncertainty-Aware Structured LLM Reasoning

Abhey Kalia; Fernanda Gon\c{c}alves; Shebin Rawther; Shireen Kudukkil Manchingal

arxiv: 2605.18871 · v1 · pith:TFU3GEGYnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI

Distributional Energy-Based Models for Uncertainty-Aware Structured LLM Reasoning

Shireen Kudukkil Manchingal , Abhey Kalia , Fernanda Gon\c{c}alves , Shebin Rawther This is my paper

Pith reviewed 2026-05-20 20:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords energy-based modelsLLM verificationuncertainty estimationstructured reasoningensemble adaptersconstraint satisfactiontwo-pass inferencelow-rank adaptation

0 comments

The pith

A 149M-parameter verifier with distributional energy-based scoring guides larger LLMs to fewer constraint violations in structured outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a verifier that decomposes an energy function into a learned quality scorer and fixed analytical penalties for constraints like budgets or logical consistency. The scorer uses an ensemble of low-rank adapters on a frozen encoder to rank candidate outputs while its standard deviation estimates uncertainty and triggers a second pass of regeneration or abstention. This setup lets the small verifier orchestrate generators from 7B to 26B parameters and beat a 72B model across benchmarks while cutting violations substantially on planning tasks. A sympathetic reader would care because it shows verification can be handled separately from generation, potentially making complex reasoning more reliable without requiring ever-larger models for every step. The method also demonstrates that scorers trained on hard problems transfer to new domains without retraining.

Core claim

A decomposed energy function that pairs an ensemble of low-rank adapters (3 percent trainable parameters on a frozen encoder) with deterministic constraint penalties can rank and refine structured LLM outputs. The ensemble mean selects the best candidate while the standard deviation drives a two-pass loop for targeted regeneration or abstention. This 149M verifier, when paired with 7-26B generators, outperforms single-shot Qwen-72B on all five benchmarks, matches Claude Sonnet on MuSR, and reduces constraint violations by 53 percent relative to Opus on TravelPlanner.

What carries the argument

The decomposed energy function that combines ensemble-based quality scoring with analytical constraint penalties, using mean for ranking and standard deviation for uncertainty-guided two-pass inference.

Load-bearing premise

The ensemble standard deviation must reliably indicate epistemic uncertainty in a way that makes the regeneration or abstention decisions improve final correctness rather than add compute without gain.

What would settle it

Running the two-pass loop on the same benchmarks and finding no reduction in constraint violations or no gain in accuracy compared with single-pass generation from the same generators.

Figures

Figures reproduced from arXiv: 2605.18871 by Abhey Kalia, Fernanda Gon\c{c}alves, Shebin Rawther, Shireen Kudukkil Manchingal.

**Figure 1.** Figure 1: Left: Chain-of-thought scores each step locally; cross-step violations are invisible. Right: Distributional energy scores N candidates globally, verifies constraints analytically, and quantifies confidence via ensemble agreement. The bottleneck, however, is not generation but selection: pass@N substantially exceeds pass@1, the correct candidate is usually somewhere in the pool of N samples from the same … view at source ↗

**Figure 2.** Figure 2: Method overview. Top: Training. Correct (y +) / incorrect (y −) candidate pairs pass through the frozen ModernBERT backbone with one of K=5 heterogeneous LoRA adapters; each adapter’s MLP head maps [CLS] to a scalar E k trained via Bradley-Terry. The K energies form a distribution: µquality ranks, σquality quantifies confidence. Bottom: Inference. Four LLMs generate N=32 candidates, scored in parallel by t… view at source ↗

**Figure 3.** Figure 3: Pass@N [Chen et al., 2021] curves. The Distributional EBM front-loads correct candidates: pass@2 reaches 97.5% on GSM8K and 71.5% on MuSR. TravelPlanner uses a relaxed violation threshold (θ=0.15). 4 2 0 2 Learned energy quality 0.0 0.1 0.2 0.3 0.4 0.5 Constraint violation score Distributional EBM selection 0.0 0.1 0.2 0.3 0.4 Violation score [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: visualises the (µquality, σquality) space; full results are in §F.5 and §F.4. 4 2 quality 0.5 1.0 quality GSM8K 1 0 quality 0.25 0.50 0.75 MuSR 4 2 0 quality 0.50 0.75 TravelPlanner 5.0 2.5 quality 1 2 TACO 10 8 6 quality 1 2 K&K Correct Incorrect = 0.8 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of ensemble size K. (a) Pass@1 is stable across K on all datasets; the ensemble does not improve top-1 accuracy. (b) σ-AUROC improves from chance (0.5 at K=1) to 0.87 on K&K at K=5, confirming that structurally heterogeneous adapters produce informative uncertainty. The ensemble is justified by uncertainty estimation, not accuracy. Single-model vs. multi-model (Tab. 12). Multi-model candidate divers… view at source ↗

**Figure 7.** Figure 7: Candidate pool scaling. The Distributional EBM’s advantage over random grows with [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Energy distributions for correct (filled) and incorrect (dashed outline) candidates on GSM8K and [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Calibration reliability diagrams. Bars show accuracy within each [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Zero-shot transfer heatmap (ModernBERT): each cell shows pass@1 and (lift over random). Colour [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

read the original abstract

When Large Language Models produce structured outputs such as travel plans, code solutions, or multi-step proofs, individual reasoning steps may appear correct while the output as a whole violates budgets, fails test cases, or contradicts earlier deductions. We propose a decomposed energy function that combines a learned quality scorer with deterministic analytical constraint penalties for verifying structured LLM outputs. The quality scorer is a heterogeneous ensemble of low-rank adapters on a single frozen encoder (3% trainable parameters); the ensemble mean ranks candidates while the standard deviation quantifies epistemic uncertainty, driving a two-pass inference loop that triggers targeted regeneration or abstention. Across five benchmarks (GSM8K, MuSR, TravelPlanner, TACO, Knights & Knaves), our 149M-parameter verifier orchestrating a pool of 7-26B open generators outperforms single-shot Qwen-72B on every benchmark, matches Claude Sonnet 4.6 on MuSR (67.7% vs. 68.0%), and reduces constraint violations by 53% relative to Opus 4.6 on TravelPlanner (oracle 0.028, random 0.231). The two routes are complementary: structural verification wins when constraints are checkable (the verifier captures signal frontier models cannot self-detect), while pretraining-scale priors win where they are not (narrative inference, code semantics). A cross-dataset confounding analysis confirms genuine quality discrimination on four reasoning tasks and identifies a model-identity shortcut on code, mitigated via last-layer retraining. Scorers trained on difficult data transfer zero-shot: a MuSR-trained scorer achieves 93.9% on GSM8K without seeing a math problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a small verifier that mixes a LoRA ensemble scorer with hard constraints to clean structured LLM outputs and reports benchmark gains, but the uncertainty-driven regeneration step lacks isolation from extra compute.

read the letter

The main takeaway is a 149M-parameter verifier that scores candidate structured outputs from larger generators using a decomposed energy function: a learned quality term from a heterogeneous low-rank adapter ensemble plus deterministic penalties for violations like budget overruns or failed tests. The ensemble standard deviation then feeds a two-pass loop that regenerates or abstains on uncertain cases. On the reported benchmarks it beats single-shot Qwen-72B across the board, matches Claude on MuSR, and cuts constraint violations substantially on TravelPlanner relative to Opus.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces distributional energy-based models for verifying structured LLM outputs such as plans, code, and proofs. It defines a decomposed energy function that combines a learned quality scorer—an ensemble of low-rank adapters on a single frozen encoder (149M parameters, 3% trainable)—with deterministic analytical constraint penalties. The ensemble mean ranks candidates while the standard deviation quantifies epistemic uncertainty to drive a two-pass inference loop for targeted regeneration or abstention. Evaluations on GSM8K, MuSR, TravelPlanner, TACO, and Knights & Knaves claim that the verifier outperforms single-shot Qwen-72B on all tasks, matches Claude Sonnet 4.6 on MuSR (67.7% vs 68.0%), and reduces constraint violations by 53% relative to Opus 4.6 on TravelPlanner. Supporting analyses include cross-dataset confounding checks and zero-shot transfer from a MuSR-trained scorer to GSM8K.

Significance. If the results hold after isolating the uncertainty mechanism, the work offers an efficient, lightweight approach to improving structured reasoning reliability by letting small verifiers capture checkable constraints that larger generators miss. The zero-shot transfer results and confounding analysis provide useful evidence of robustness and genuine discrimination rather than dataset artifacts. These elements, combined with the parameter efficiency, could influence practical systems for planning and verification tasks.

major comments (2)

[§4 (Experiments/Results)] §4 (Experiments/Results): The headline gains on MuSR and TravelPlanner are reported only for the full two-pass system. No ablation is described that holds the energy function, candidate pool size, and total inference budget fixed while removing the ensemble standard deviation trigger (e.g., regenerating top-k unconditionally or based solely on mean score). This leaves open whether observed improvements stem from epistemic uncertainty quantification or simply from extra generations, which is load-bearing for the paper's central claim about uncertainty-aware reasoning.
[§3 (Method)] §3 (Method): The exact formulation of the decomposed energy function—how the ensemble mean and standard deviation are computed from the heterogeneous LoRAs and how they are weighted against the deterministic penalties—is not provided in sufficient detail. This affects assessment of reproducibility and whether the mean-std separation is doing independent work beyond the constraint penalties.

minor comments (2)

[Abstract] Abstract: The constraint violation rates (oracle 0.028, random 0.231) are stated without immediate reference to the corresponding table or definition, reducing clarity for readers.
[Tables in §4] Tables in §4: Inclusion of error bars or multiple-run statistics on the benchmark accuracies would strengthen the presentation of the performance comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of experimental rigor and methodological clarity that will strengthen the manuscript. We address each major comment below and describe the revisions we plan to incorporate.

read point-by-point responses

Referee: The headline gains on MuSR and TravelPlanner are reported only for the full two-pass system. No ablation is described that holds the energy function, candidate pool size, and total inference budget fixed while removing the ensemble standard deviation trigger (e.g., regenerating top-k unconditionally or based solely on mean score). This leaves open whether observed improvements stem from epistemic uncertainty quantification or simply from extra generations, which is load-bearing for the paper's central claim about uncertainty-aware reasoning.

Authors: We agree that an explicit ablation isolating the contribution of the ensemble standard deviation is necessary to substantiate the central claim regarding uncertainty-aware reasoning. While the current experiments demonstrate the performance of the full two-pass system, we will add a controlled ablation in the revised manuscript. This ablation will hold the energy function, candidate pool size, and total inference budget fixed, comparing regeneration triggered by the standard deviation against variants that regenerate based solely on the mean score or unconditionally. The results will clarify whether the observed gains derive from epistemic uncertainty quantification beyond additional generations. revision: yes
Referee: The exact formulation of the decomposed energy function—how the ensemble mean and standard deviation are computed from the heterogeneous LoRAs and how they are weighted against the deterministic penalties—is not provided in sufficient detail. This affects assessment of reproducibility and whether the mean-std separation is doing independent work beyond the constraint penalties.

Authors: We acknowledge that the precise mathematical formulation of the decomposed energy function requires additional detail for reproducibility. In the revised manuscript, we will expand Section 3 to include the exact equations: specifically, how the ensemble mean and standard deviation are computed across the heterogeneous low-rank adapters on the frozen encoder, and the weighting scheme that combines these statistics with the deterministic analytical constraint penalties in the overall energy function. This will allow readers to evaluate the independent contributions of the mean-std separation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims grounded in cross-dataset transfer and external benchmarks

full rationale

The paper's core contribution is an empirical verifier system (ensemble of low-rank adapters plus deterministic penalties) evaluated on five benchmarks with explicit zero-shot transfer (MuSR-trained scorer on GSM8K) and cross-dataset confounding analysis. These elements supply independent grounding outside any single fitted set. No equations reduce a claimed prediction to its own inputs by construction, no load-bearing self-citation chains appear, and the two-pass loop is presented as an engineering choice rather than a derived necessity. The results are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond standard ML training assumptions and the existence of checkable constraints.

pith-pipeline@v0.9.0 · 5844 in / 1194 out tokens · 29300 ms · 2026-05-20T20:42:13.545833+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 5 internal anchors

[1]

Transactions on Machine Learning Research , year =

A Hitchhiker's Guide to the Relation of Energy-Based Models with Other Generative Models, Sampling and Statistical Physics , author =. Transactions on Machine Learning Research , year =

work page
[2]

2025 , journal =

Autoregressive Language Models Are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction , author =. 2025 , journal =

work page 2025
[3]

arxiv:2512.18730 , year=

A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models , author=. arxiv:2512.18730 , year=

work page arXiv
[4]

Learning to

Jiang, Eric Hanchen and others , year =. Learning to

work page
[5]

Chen, Zhikang and Cui, Sen and Ye, Deheng and Zhang, Yu and Bian, Yatao and Zhu, Tingting , year =. Think

work page
[6]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[7]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[8]

2023 , journal =

Let's Verify Step by Step , author =. 2023 , journal =

work page 2023
[9]

Khalifa, Muhammad and others , year =. Process

work page
[10]

IJCNLP-AACL , year=

On Memorization of Large Language Models in Logical Reasoning , author=. IJCNLP-AACL , year=

work page
[11]

International Conference on Machine Learning , year=

TravelPlanner: A Benchmark for Real-World Planning with Language Agents , author=. International Conference on Machine Learning , year=

work page
[12]

2025 , note =

Lin, Bill Yuchen and others , booktitle =. 2025 , note =

work page 2025
[13]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems, 2021 , author=. URL https://arxiv.org/abs/2110.14168 , volume=

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Evolving Deeper

Lee, Kuang-Huei and Fischer, Ian and Wu, Yueh-Hua and Marwood, Dave and Baluja, Shumeet and Schuurmans, Dale and Chen, Xinyun , year =. Evolving Deeper

work page
[15]

Charlie Victor Snell and Jaehoon Lee and Kelvin Xu and Aviral Kumar , booktitle=. Scaling. 2025 , url=

work page 2025
[16]

2024 , journal =

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling , author =. 2024 , journal =

work page 2024
[17]

2025 , journal =

Why We Think , author =. 2025 , journal =

work page 2025
[18]

A gent RM : Enhancing Agent Generalization with Reward Modeling

Xia, Yu and Fan, Jingru and Chen, Weize and Yan, Siyu and Cong, Xin and Zhang, Zhong and Lu, Yaxi and Lin, Yankai and Liu, Zhiyuan and Sun, Maosong. A gent RM : Enhancing Agent Generalization with Reward Modeling. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl...

work page doi:10.18653/v1/2025.acl-long.945 2025
[19]

2022 , journal =

Uncertainty Estimation for Language Reward Models , author =. 2022 , journal =

work page 2022
[20]

2019 , journal =

Deep Ensembles: A Loss Landscape Perspective , author =. 2019 , journal =

work page 2019
[21]

Simple and

Lakshminarayanan, Balaji and Pritzel, Alexander and Blundell, Charles , booktitle =. Simple and. 2017 , url=

work page 2017
[22]

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

Pengcheng He and Jianfeng Gao and Weizhu Chen , year=. DeBERTaV3: Improving. 2111.09543 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

2024 , journal =

Qwen2.5 Technical Report , author =. 2024 , journal =

work page 2024
[24]

, journal =

Bradley, Ralph Allan and Terry, Milton E. , journal =. Rank Analysis of Incomplete Block Designs:. 1952 , publisher=

work page 1952
[25]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[26]

2023 , note =

OpenAI , institution =. 2023 , note =

work page 2023
[27]

The Eleventh International Conference on Learning Representations , year=

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[28]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[29]

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning , volume =

Sprague, Zayne and Ye, Xi and Bostrom, Kaj and Chaudhuri, Swarat and Durrett, Greg , booktitle =. MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning , volume =

work page
[30]

Li, Rongao and Fu, Jie and Zhang, Bo-Wen and Huang, Tao and Sun, Zhihong and Lyu, Chen and Liu, Guang and Jin, Zhi and Li, Ge , year =

work page
[31]

2025 , note =

Introducing gpt-oss , author =. 2025 , note =

work page 2025
[32]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[33]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arxiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Gemma 4 Technical Report , author =

work page
[35]

2026 , note =

The Claude Model Family: Claude Opus 4.6, Claude Sonnet 4.6 , author =. 2026 , note =

work page 2026
[36]

Nature Machine Intelligence , volume =

Shortcut Learning in Deep Neural Networks , author =. Nature Machine Intelligence , volume =. 2020 , publisher=

work page 2020
[37]

International Conference on Learning Representations (ICLR) , year =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations (ICLR) , year =

work page
[38]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[39]

ArXiv , year=

Solving math word problems with process- and outcome-based feedback , author=. ArXiv , year=

work page
[40]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and others , year =

work page
[41]

Parameter-Efficient Transfer Learning for

Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and de Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , booktitle =. Parameter-Efficient Transfer Learning for. 2019 , note =

work page 2019
[42]

and Lowe, Ryan and Voss, Chelsea and Radford, Alec and Amodei, Dario and Christiano, Paul , title =

Stiennon, Nisan and Ouyang, Long and Wu, Jeff and Ziegler, Daniel M. and Lowe, Ryan and Voss, Chelsea and Radford, Alec and Amodei, Dario and Christiano, Paul , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =

work page 2020
[43]

International Conference on Learning Representations (ICLR) , year =

Proving Test Set Contamination in Black Box Language Models , author =. International Conference on Learning Representations (ICLR) , year =

work page
[44]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , note =

work page 2022
[45]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , year=. Measuring. 2103.03874 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

arxiv:2206.07609 , year =

Epistemic Deep Learning , author =. arxiv:2206.07609 , year =

work page arXiv
[47]

Random-Set Neural Networks (

Manchingal, Shireen Kudukkil and Mubashar, Muhammad and Wang, Kaizheng and Shariatmadar, Keivan and Cuzzolin, Fabio , booktitle =. Random-Set Neural Networks (. 2025 , url =

work page 2025
[48]

, booktitle =

Kaushik, Divyansh and Hovy, Eduard and Lipton, Zachary C. , booktitle =. Learning the. 2020 , note =

work page 2020
[49]

Kirichenko, Polina and Izmailov, Pavel and Wilson, Andrew Gordon , booktitle =. Last. 2023 , note =

work page 2023
[50]

ArXiv , year=

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples , author=. ArXiv , year=

work page
[51]

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , booktitle =. Large. 2022 , note =

work page 2022
[52]

Deep Reinforcement Learning from Human Preferences , url =

Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario , booktitle =. Deep Reinforcement Learning from Human Preferences , url =

work page
[53]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[54]

Dropout as a

Gal, Yarin and Ghahramani, Zoubin , booktitle=. Dropout as a. 2016 , organization=

work page 2016
[55]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Self-Refine: Iterative Refinement with Self-Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[56]

International Conference on Learning Representations (ICLR) , year =

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization , author =. International Conference on Learning Representations (ICLR) , year =

work page
[57]

International Conference on Learning Representations (ICLR) , year=

Residual energy-based models for text generation , author=. International Conference on Learning Representations (ICLR) , year=

work page
[58]

The Thirteenth International Conference on Learning Representations , year=

Mixture-of-Agents Enhances Large Language Model Capabilities , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[59]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[60]

2019 , note =

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =. 2019 , note =

work page 2019
[61]

arXiv: Learning , year=

Gaussian Error Linear Units (GELUs) , author=. arXiv: Learning , year=

work page
[62]

ArXiv , year=

Layer Normalization , author=. ArXiv , year=

work page
[63]

2017 , note =

Loshchilov, Ilya and Hutter, Frank , booktitle =. 2017 , note =

work page 2017
[64]

Holtzman, Ari and Buys, Jan and Du, Li and Forbes, Maxwell and Choi, Yejin , booktitle =. The. 2020 , note =

work page 2020
[65]

International conference on machine learning , pages=

On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[66]

Nature Machine Intelligence , volume=

Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=

work page 2020
[67]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author=. arxiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Machine learning , volume=

Bagging predictors , author=. Machine learning , volume=. 1996 , publisher=

work page 1996
[69]

LeCun, Yann and Chopra, Sumit and Hadsell, Raia and Ranzato, Marc'Aurelio and Huang, Fu Jie , journal =. A. 2006 , publisher =

work page 2006
[70]

Skywork-Reward: Bag of Tricks for Reward Modeling in

Liu, Chris Yuhao and Zeng, Liang and Liu, Jiacai and Yan, Rui and He, Jujie and Wang, Chaojie and Yan, Shuicheng and Liu, Yang and Zhou, Yahui , journal=. Skywork-Reward: Bag of Tricks for Reward Modeling in

work page
[71]

Math-Shepherd: Verify and Reinforce

Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang , booktitle=. Math-Shepherd: Verify and Reinforce

work page
[72]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle=. Judging

work page

[1] [1]

Transactions on Machine Learning Research , year =

A Hitchhiker's Guide to the Relation of Energy-Based Models with Other Generative Models, Sampling and Statistical Physics , author =. Transactions on Machine Learning Research , year =

work page

[2] [2]

2025 , journal =

Autoregressive Language Models Are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction , author =. 2025 , journal =

work page 2025

[3] [3]

arxiv:2512.18730 , year=

A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models , author=. arxiv:2512.18730 , year=

work page arXiv

[4] [4]

Learning to

Jiang, Eric Hanchen and others , year =. Learning to

work page

[5] [5]

Chen, Zhikang and Cui, Sen and Ye, Deheng and Zhang, Yu and Bian, Yatao and Zhu, Tingting , year =. Think

work page

[6] [6]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page

[7] [7]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page

[8] [8]

2023 , journal =

Let's Verify Step by Step , author =. 2023 , journal =

work page 2023

[9] [9]

Khalifa, Muhammad and others , year =. Process

work page

[10] [10]

IJCNLP-AACL , year=

On Memorization of Large Language Models in Logical Reasoning , author=. IJCNLP-AACL , year=

work page

[11] [11]

International Conference on Machine Learning , year=

TravelPlanner: A Benchmark for Real-World Planning with Language Agents , author=. International Conference on Machine Learning , year=

work page

[12] [12]

2025 , note =

Lin, Bill Yuchen and others , booktitle =. 2025 , note =

work page 2025

[13] [13]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems, 2021 , author=. URL https://arxiv.org/abs/2110.14168 , volume=

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Evolving Deeper

Lee, Kuang-Huei and Fischer, Ian and Wu, Yueh-Hua and Marwood, Dave and Baluja, Shumeet and Schuurmans, Dale and Chen, Xinyun , year =. Evolving Deeper

work page

[15] [15]

Charlie Victor Snell and Jaehoon Lee and Kelvin Xu and Aviral Kumar , booktitle=. Scaling. 2025 , url=

work page 2025

[16] [16]

2024 , journal =

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling , author =. 2024 , journal =

work page 2024

[17] [17]

2025 , journal =

Why We Think , author =. 2025 , journal =

work page 2025

[18] [18]

A gent RM : Enhancing Agent Generalization with Reward Modeling

Xia, Yu and Fan, Jingru and Chen, Weize and Yan, Siyu and Cong, Xin and Zhang, Zhong and Lu, Yaxi and Lin, Yankai and Liu, Zhiyuan and Sun, Maosong. A gent RM : Enhancing Agent Generalization with Reward Modeling. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl...

work page doi:10.18653/v1/2025.acl-long.945 2025

[19] [19]

2022 , journal =

Uncertainty Estimation for Language Reward Models , author =. 2022 , journal =

work page 2022

[20] [20]

2019 , journal =

Deep Ensembles: A Loss Landscape Perspective , author =. 2019 , journal =

work page 2019

[21] [21]

Simple and

Lakshminarayanan, Balaji and Pritzel, Alexander and Blundell, Charles , booktitle =. Simple and. 2017 , url=

work page 2017

[22] [22]

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

Pengcheng He and Jianfeng Gao and Weizhu Chen , year=. DeBERTaV3: Improving. 2111.09543 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

2024 , journal =

Qwen2.5 Technical Report , author =. 2024 , journal =

work page 2024

[24] [24]

, journal =

Bradley, Ralph Allan and Terry, Milton E. , journal =. Rank Analysis of Incomplete Block Designs:. 1952 , publisher=

work page 1952

[25] [25]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page

[26] [26]

2023 , note =

OpenAI , institution =. 2023 , note =

work page 2023

[27] [27]

The Eleventh International Conference on Learning Representations , year=

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page

[28] [28]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page

[29] [29]

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning , volume =

Sprague, Zayne and Ye, Xi and Bostrom, Kaj and Chaudhuri, Swarat and Durrett, Greg , booktitle =. MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning , volume =

work page

[30] [30]

Li, Rongao and Fu, Jie and Zhang, Bo-Wen and Huang, Tao and Sun, Zhihong and Lyu, Chen and Liu, Guang and Jin, Zhi and Li, Ge , year =

work page

[31] [31]

2025 , note =

Introducing gpt-oss , author =. 2025 , note =

work page 2025

[32] [32]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[33] [33]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arxiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Gemma 4 Technical Report , author =

work page

[35] [35]

2026 , note =

The Claude Model Family: Claude Opus 4.6, Claude Sonnet 4.6 , author =. 2026 , note =

work page 2026

[36] [36]

Nature Machine Intelligence , volume =

Shortcut Learning in Deep Neural Networks , author =. Nature Machine Intelligence , volume =. 2020 , publisher=

work page 2020

[37] [37]

International Conference on Learning Representations (ICLR) , year =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations (ICLR) , year =

work page

[38] [38]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[39] [39]

ArXiv , year=

Solving math word problems with process- and outcome-based feedback , author=. ArXiv , year=

work page

[40] [40]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and others , year =

work page

[41] [41]

Parameter-Efficient Transfer Learning for

Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and de Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , booktitle =. Parameter-Efficient Transfer Learning for. 2019 , note =

work page 2019

[42] [42]

and Lowe, Ryan and Voss, Chelsea and Radford, Alec and Amodei, Dario and Christiano, Paul , title =

Stiennon, Nisan and Ouyang, Long and Wu, Jeff and Ziegler, Daniel M. and Lowe, Ryan and Voss, Chelsea and Radford, Alec and Amodei, Dario and Christiano, Paul , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =

work page 2020

[43] [43]

International Conference on Learning Representations (ICLR) , year =

Proving Test Set Contamination in Black Box Language Models , author =. International Conference on Learning Representations (ICLR) , year =

work page

[44] [44]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , note =

work page 2022

[45] [45]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , year=. Measuring. 2103.03874 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

arxiv:2206.07609 , year =

Epistemic Deep Learning , author =. arxiv:2206.07609 , year =

work page arXiv

[47] [47]

Random-Set Neural Networks (

Manchingal, Shireen Kudukkil and Mubashar, Muhammad and Wang, Kaizheng and Shariatmadar, Keivan and Cuzzolin, Fabio , booktitle =. Random-Set Neural Networks (. 2025 , url =

work page 2025

[48] [48]

, booktitle =

Kaushik, Divyansh and Hovy, Eduard and Lipton, Zachary C. , booktitle =. Learning the. 2020 , note =

work page 2020

[49] [49]

Kirichenko, Polina and Izmailov, Pavel and Wilson, Andrew Gordon , booktitle =. Last. 2023 , note =

work page 2023

[50] [50]

ArXiv , year=

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples , author=. ArXiv , year=

work page

[51] [51]

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , booktitle =. Large. 2022 , note =

work page 2022

[52] [52]

Deep Reinforcement Learning from Human Preferences , url =

Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario , booktitle =. Deep Reinforcement Learning from Human Preferences , url =

work page

[53] [53]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[54] [54]

Dropout as a

Gal, Yarin and Ghahramani, Zoubin , booktitle=. Dropout as a. 2016 , organization=

work page 2016

[55] [55]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Self-Refine: Iterative Refinement with Self-Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[56] [56]

International Conference on Learning Representations (ICLR) , year =

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization , author =. International Conference on Learning Representations (ICLR) , year =

work page

[57] [57]

International Conference on Learning Representations (ICLR) , year=

Residual energy-based models for text generation , author=. International Conference on Learning Representations (ICLR) , year=

work page

[58] [58]

The Thirteenth International Conference on Learning Representations , year=

Mixture-of-Agents Enhances Large Language Model Capabilities , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[59] [59]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[60] [60]

2019 , note =

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =. 2019 , note =

work page 2019

[61] [61]

arXiv: Learning , year=

Gaussian Error Linear Units (GELUs) , author=. arXiv: Learning , year=

work page

[62] [62]

ArXiv , year=

Layer Normalization , author=. ArXiv , year=

work page

[63] [63]

2017 , note =

Loshchilov, Ilya and Hutter, Frank , booktitle =. 2017 , note =

work page 2017

[64] [64]

Holtzman, Ari and Buys, Jan and Du, Li and Forbes, Maxwell and Choi, Yejin , booktitle =. The. 2020 , note =

work page 2020

[65] [65]

International conference on machine learning , pages=

On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017

[66] [66]

Nature Machine Intelligence , volume=

Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=

work page 2020

[67] [67]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author=. arxiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[68] [68]

Machine learning , volume=

Bagging predictors , author=. Machine learning , volume=. 1996 , publisher=

work page 1996

[69] [69]

LeCun, Yann and Chopra, Sumit and Hadsell, Raia and Ranzato, Marc'Aurelio and Huang, Fu Jie , journal =. A. 2006 , publisher =

work page 2006

[70] [70]

Skywork-Reward: Bag of Tricks for Reward Modeling in

Liu, Chris Yuhao and Zeng, Liang and Liu, Jiacai and Yan, Rui and He, Jujie and Wang, Chaojie and Yan, Shuicheng and Liu, Yang and Zhou, Yahui , journal=. Skywork-Reward: Bag of Tricks for Reward Modeling in

work page

[71] [71]

Math-Shepherd: Verify and Reinforce

Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang , booktitle=. Math-Shepherd: Verify and Reinforce

work page

[72] [72]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle=. Judging

work page