pith. sign in

arxiv: 2605.18871 · v1 · pith:TFU3GEGYnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI

Distributional Energy-Based Models for Uncertainty-Aware Structured LLM Reasoning

Pith reviewed 2026-05-20 20:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords energy-based modelsLLM verificationuncertainty estimationstructured reasoningensemble adaptersconstraint satisfactiontwo-pass inferencelow-rank adaptation
0
0 comments X

The pith

A 149M-parameter verifier with distributional energy-based scoring guides larger LLMs to fewer constraint violations in structured outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a verifier that decomposes an energy function into a learned quality scorer and fixed analytical penalties for constraints like budgets or logical consistency. The scorer uses an ensemble of low-rank adapters on a frozen encoder to rank candidate outputs while its standard deviation estimates uncertainty and triggers a second pass of regeneration or abstention. This setup lets the small verifier orchestrate generators from 7B to 26B parameters and beat a 72B model across benchmarks while cutting violations substantially on planning tasks. A sympathetic reader would care because it shows verification can be handled separately from generation, potentially making complex reasoning more reliable without requiring ever-larger models for every step. The method also demonstrates that scorers trained on hard problems transfer to new domains without retraining.

Core claim

A decomposed energy function that pairs an ensemble of low-rank adapters (3 percent trainable parameters on a frozen encoder) with deterministic constraint penalties can rank and refine structured LLM outputs. The ensemble mean selects the best candidate while the standard deviation drives a two-pass loop for targeted regeneration or abstention. This 149M verifier, when paired with 7-26B generators, outperforms single-shot Qwen-72B on all five benchmarks, matches Claude Sonnet on MuSR, and reduces constraint violations by 53 percent relative to Opus on TravelPlanner.

What carries the argument

The decomposed energy function that combines ensemble-based quality scoring with analytical constraint penalties, using mean for ranking and standard deviation for uncertainty-guided two-pass inference.

Load-bearing premise

The ensemble standard deviation must reliably indicate epistemic uncertainty in a way that makes the regeneration or abstention decisions improve final correctness rather than add compute without gain.

What would settle it

Running the two-pass loop on the same benchmarks and finding no reduction in constraint violations or no gain in accuracy compared with single-pass generation from the same generators.

Figures

Figures reproduced from arXiv: 2605.18871 by Abhey Kalia, Fernanda Gon\c{c}alves, Shebin Rawther, Shireen Kudukkil Manchingal.

Figure 1
Figure 1. Figure 1: Left: Chain-of-thought scores each step locally; cross-step violations are invisible. Right: Distributional energy scores N candidates glob￾ally, verifies constraints analytically, and quantifies confidence via ensemble agreement. The bottleneck, however, is not generation but se￾lection: pass@N substantially exceeds pass@1, the correct candidate is usually somewhere in the pool of N samples from the same … view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. Top: Training. Correct (y +) / incorrect (y −) candidate pairs pass through the frozen ModernBERT backbone with one of K=5 heterogeneous LoRA adapters; each adapter’s MLP head maps [CLS] to a scalar E k trained via Bradley-Terry. The K energies form a distribution: µquality ranks, σquality quantifies confidence. Bottom: Inference. Four LLMs generate N=32 candidates, scored in parallel by t… view at source ↗
Figure 3
Figure 3. Figure 3: Pass@N [Chen et al., 2021] curves. The Distributional EBM front-loads correct candidates: pass@2 reaches 97.5% on GSM8K and 71.5% on MuSR. TravelPlanner uses a relaxed violation thresh￾old (θ=0.15). 4 2 0 2 Learned energy quality 0.0 0.1 0.2 0.3 0.4 0.5 Constraint violation score Distributional EBM selection 0.0 0.1 0.2 0.3 0.4 Violation score [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: visualises the (µquality, σquality) space; full results are in §F.5 and §F.4. 4 2 quality 0.5 1.0 quality GSM8K 1 0 quality 0.25 0.50 0.75 MuSR 4 2 0 quality 0.50 0.75 TravelPlanner 5.0 2.5 quality 1 2 TACO 10 8 6 quality 1 2 K&K Correct Incorrect = 0.8 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of ensemble size K. (a) Pass@1 is stable across K on all datasets; the ensemble does not improve top-1 accuracy. (b) σ-AUROC improves from chance (0.5 at K=1) to 0.87 on K&K at K=5, confirming that structurally heterogeneous adapters produce informative uncertainty. The ensemble is justified by uncertainty estimation, not accuracy. Single-model vs. multi-model (Tab. 12). Multi-model candidate divers… view at source ↗
Figure 7
Figure 7. Figure 7: Candidate pool scaling. The Distributional EBM’s advantage over random grows with [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Energy distributions for correct (filled) and incorrect (dashed outline) candidates on GSM8K and [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Calibration reliability diagrams. Bars show accuracy within each [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Zero-shot transfer heatmap (ModernBERT): each cell shows pass@1 and (lift over random). Colour [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
read the original abstract

When Large Language Models produce structured outputs such as travel plans, code solutions, or multi-step proofs, individual reasoning steps may appear correct while the output as a whole violates budgets, fails test cases, or contradicts earlier deductions. We propose a decomposed energy function that combines a learned quality scorer with deterministic analytical constraint penalties for verifying structured LLM outputs. The quality scorer is a heterogeneous ensemble of low-rank adapters on a single frozen encoder (3% trainable parameters); the ensemble mean ranks candidates while the standard deviation quantifies epistemic uncertainty, driving a two-pass inference loop that triggers targeted regeneration or abstention. Across five benchmarks (GSM8K, MuSR, TravelPlanner, TACO, Knights & Knaves), our 149M-parameter verifier orchestrating a pool of 7-26B open generators outperforms single-shot Qwen-72B on every benchmark, matches Claude Sonnet 4.6 on MuSR (67.7% vs. 68.0%), and reduces constraint violations by 53% relative to Opus 4.6 on TravelPlanner (oracle 0.028, random 0.231). The two routes are complementary: structural verification wins when constraints are checkable (the verifier captures signal frontier models cannot self-detect), while pretraining-scale priors win where they are not (narrative inference, code semantics). A cross-dataset confounding analysis confirms genuine quality discrimination on four reasoning tasks and identifies a model-identity shortcut on code, mitigated via last-layer retraining. Scorers trained on difficult data transfer zero-shot: a MuSR-trained scorer achieves 93.9% on GSM8K without seeing a math problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces distributional energy-based models for verifying structured LLM outputs such as plans, code, and proofs. It defines a decomposed energy function that combines a learned quality scorer—an ensemble of low-rank adapters on a single frozen encoder (149M parameters, 3% trainable)—with deterministic analytical constraint penalties. The ensemble mean ranks candidates while the standard deviation quantifies epistemic uncertainty to drive a two-pass inference loop for targeted regeneration or abstention. Evaluations on GSM8K, MuSR, TravelPlanner, TACO, and Knights & Knaves claim that the verifier outperforms single-shot Qwen-72B on all tasks, matches Claude Sonnet 4.6 on MuSR (67.7% vs 68.0%), and reduces constraint violations by 53% relative to Opus 4.6 on TravelPlanner. Supporting analyses include cross-dataset confounding checks and zero-shot transfer from a MuSR-trained scorer to GSM8K.

Significance. If the results hold after isolating the uncertainty mechanism, the work offers an efficient, lightweight approach to improving structured reasoning reliability by letting small verifiers capture checkable constraints that larger generators miss. The zero-shot transfer results and confounding analysis provide useful evidence of robustness and genuine discrimination rather than dataset artifacts. These elements, combined with the parameter efficiency, could influence practical systems for planning and verification tasks.

major comments (2)
  1. [§4 (Experiments/Results)] §4 (Experiments/Results): The headline gains on MuSR and TravelPlanner are reported only for the full two-pass system. No ablation is described that holds the energy function, candidate pool size, and total inference budget fixed while removing the ensemble standard deviation trigger (e.g., regenerating top-k unconditionally or based solely on mean score). This leaves open whether observed improvements stem from epistemic uncertainty quantification or simply from extra generations, which is load-bearing for the paper's central claim about uncertainty-aware reasoning.
  2. [§3 (Method)] §3 (Method): The exact formulation of the decomposed energy function—how the ensemble mean and standard deviation are computed from the heterogeneous LoRAs and how they are weighted against the deterministic penalties—is not provided in sufficient detail. This affects assessment of reproducibility and whether the mean-std separation is doing independent work beyond the constraint penalties.
minor comments (2)
  1. [Abstract] Abstract: The constraint violation rates (oracle 0.028, random 0.231) are stated without immediate reference to the corresponding table or definition, reducing clarity for readers.
  2. [Tables in §4] Tables in §4: Inclusion of error bars or multiple-run statistics on the benchmark accuracies would strengthen the presentation of the performance comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of experimental rigor and methodological clarity that will strengthen the manuscript. We address each major comment below and describe the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: The headline gains on MuSR and TravelPlanner are reported only for the full two-pass system. No ablation is described that holds the energy function, candidate pool size, and total inference budget fixed while removing the ensemble standard deviation trigger (e.g., regenerating top-k unconditionally or based solely on mean score). This leaves open whether observed improvements stem from epistemic uncertainty quantification or simply from extra generations, which is load-bearing for the paper's central claim about uncertainty-aware reasoning.

    Authors: We agree that an explicit ablation isolating the contribution of the ensemble standard deviation is necessary to substantiate the central claim regarding uncertainty-aware reasoning. While the current experiments demonstrate the performance of the full two-pass system, we will add a controlled ablation in the revised manuscript. This ablation will hold the energy function, candidate pool size, and total inference budget fixed, comparing regeneration triggered by the standard deviation against variants that regenerate based solely on the mean score or unconditionally. The results will clarify whether the observed gains derive from epistemic uncertainty quantification beyond additional generations. revision: yes

  2. Referee: The exact formulation of the decomposed energy function—how the ensemble mean and standard deviation are computed from the heterogeneous LoRAs and how they are weighted against the deterministic penalties—is not provided in sufficient detail. This affects assessment of reproducibility and whether the mean-std separation is doing independent work beyond the constraint penalties.

    Authors: We acknowledge that the precise mathematical formulation of the decomposed energy function requires additional detail for reproducibility. In the revised manuscript, we will expand Section 3 to include the exact equations: specifically, how the ensemble mean and standard deviation are computed across the heterogeneous low-rank adapters on the frozen encoder, and the weighting scheme that combines these statistics with the deterministic analytical constraint penalties in the overall energy function. This will allow readers to evaluate the independent contributions of the mean-std separation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims grounded in cross-dataset transfer and external benchmarks

full rationale

The paper's core contribution is an empirical verifier system (ensemble of low-rank adapters plus deterministic penalties) evaluated on five benchmarks with explicit zero-shot transfer (MuSR-trained scorer on GSM8K) and cross-dataset confounding analysis. These elements supply independent grounding outside any single fitted set. No equations reduce a claimed prediction to its own inputs by construction, no load-bearing self-citation chains appear, and the two-pass loop is presented as an engineering choice rather than a derived necessity. The results are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond standard ML training assumptions and the existence of checkable constraints.

pith-pipeline@v0.9.0 · 5844 in / 1194 out tokens · 29300 ms · 2026-05-20T20:42:13.545833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 5 internal anchors

  1. [1]

    Transactions on Machine Learning Research , year =

    A Hitchhiker's Guide to the Relation of Energy-Based Models with Other Generative Models, Sampling and Statistical Physics , author =. Transactions on Machine Learning Research , year =

  2. [2]

    2025 , journal =

    Autoregressive Language Models Are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction , author =. 2025 , journal =

  3. [3]

    arxiv:2512.18730 , year=

    A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models , author=. arxiv:2512.18730 , year=

  4. [4]

    Learning to

    Jiang, Eric Hanchen and others , year =. Learning to

  5. [5]

    Chen, Zhikang and Cui, Sen and Ye, Deheng and Zhang, Yu and Bian, Yatao and Zhu, Tingting , year =. Think

  6. [6]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  7. [7]

    The Eleventh International Conference on Learning Representations , year=

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  8. [8]

    2023 , journal =

    Let's Verify Step by Step , author =. 2023 , journal =

  9. [9]

    Khalifa, Muhammad and others , year =. Process

  10. [10]

    IJCNLP-AACL , year=

    On Memorization of Large Language Models in Logical Reasoning , author=. IJCNLP-AACL , year=

  11. [11]

    International Conference on Machine Learning , year=

    TravelPlanner: A Benchmark for Real-World Planning with Language Agents , author=. International Conference on Machine Learning , year=

  12. [12]

    2025 , note =

    Lin, Bill Yuchen and others , booktitle =. 2025 , note =

  13. [13]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems, 2021 , author=. URL https://arxiv.org/abs/2110.14168 , volume=

  14. [14]

    Evolving Deeper

    Lee, Kuang-Huei and Fischer, Ian and Wu, Yueh-Hua and Marwood, Dave and Baluja, Shumeet and Schuurmans, Dale and Chen, Xinyun , year =. Evolving Deeper

  15. [15]

    Charlie Victor Snell and Jaehoon Lee and Kelvin Xu and Aviral Kumar , booktitle=. Scaling. 2025 , url=

  16. [16]

    2024 , journal =

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling , author =. 2024 , journal =

  17. [17]

    2025 , journal =

    Why We Think , author =. 2025 , journal =

  18. [18]

    A gent RM : Enhancing Agent Generalization with Reward Modeling

    Xia, Yu and Fan, Jingru and Chen, Weize and Yan, Siyu and Cong, Xin and Zhang, Zhong and Lu, Yaxi and Lin, Yankai and Liu, Zhiyuan and Sun, Maosong. A gent RM : Enhancing Agent Generalization with Reward Modeling. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl...

  19. [19]

    2022 , journal =

    Uncertainty Estimation for Language Reward Models , author =. 2022 , journal =

  20. [20]

    2019 , journal =

    Deep Ensembles: A Loss Landscape Perspective , author =. 2019 , journal =

  21. [21]

    Simple and

    Lakshminarayanan, Balaji and Pritzel, Alexander and Blundell, Charles , booktitle =. Simple and. 2017 , url=

  22. [22]

    DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

    Pengcheng He and Jianfeng Gao and Weizhu Chen , year=. DeBERTaV3: Improving. 2111.09543 , archivePrefix=

  23. [23]

    2024 , journal =

    Qwen2.5 Technical Report , author =. 2024 , journal =

  24. [24]

    , journal =

    Bradley, Ralph Allan and Terry, Milton E. , journal =. Rank Analysis of Incomplete Block Designs:. 1952 , publisher=

  25. [25]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  26. [26]

    2023 , note =

    OpenAI , institution =. 2023 , note =

  27. [27]

    The Eleventh International Conference on Learning Representations , year=

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  28. [28]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  29. [29]

    MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning , volume =

    Sprague, Zayne and Ye, Xi and Bostrom, Kaj and Chaudhuri, Swarat and Durrett, Greg , booktitle =. MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning , volume =

  30. [30]

    Li, Rongao and Fu, Jie and Zhang, Bo-Wen and Huang, Tao and Sun, Zhihong and Lyu, Chen and Liu, Guang and Jin, Zhi and Li, Ge , year =

  31. [31]

    2025 , note =

    Introducing gpt-oss , author =. 2025 , note =

  32. [32]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  33. [33]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arxiv:2407.21783 , year=

  34. [34]

    Gemma 4 Technical Report , author =

  35. [35]

    2026 , note =

    The Claude Model Family: Claude Opus 4.6, Claude Sonnet 4.6 , author =. 2026 , note =

  36. [36]

    Nature Machine Intelligence , volume =

    Shortcut Learning in Deep Neural Networks , author =. Nature Machine Intelligence , volume =. 2020 , publisher=

  37. [37]

    International Conference on Learning Representations (ICLR) , year =

    Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations (ICLR) , year =

  38. [38]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  39. [39]

    ArXiv , year=

    Solving math word problems with process- and outcome-based feedback , author=. ArXiv , year=

  40. [40]

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and others , year =

  41. [41]

    Parameter-Efficient Transfer Learning for

    Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and de Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , booktitle =. Parameter-Efficient Transfer Learning for. 2019 , note =

  42. [42]

    and Lowe, Ryan and Voss, Chelsea and Radford, Alec and Amodei, Dario and Christiano, Paul , title =

    Stiennon, Nisan and Ouyang, Long and Wu, Jeff and Ziegler, Daniel M. and Lowe, Ryan and Voss, Chelsea and Radford, Alec and Amodei, Dario and Christiano, Paul , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =

  43. [43]

    International Conference on Learning Representations (ICLR) , year =

    Proving Test Set Contamination in Black Box Language Models , author =. International Conference on Learning Representations (ICLR) , year =

  44. [44]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , note =

  45. [45]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , year=. Measuring. 2103.03874 , archivePrefix=

  46. [46]

    arxiv:2206.07609 , year =

    Epistemic Deep Learning , author =. arxiv:2206.07609 , year =

  47. [47]

    Random-Set Neural Networks (

    Manchingal, Shireen Kudukkil and Mubashar, Muhammad and Wang, Kaizheng and Shariatmadar, Keivan and Cuzzolin, Fabio , booktitle =. Random-Set Neural Networks (. 2025 , url =

  48. [48]

    , booktitle =

    Kaushik, Divyansh and Hovy, Eduard and Lipton, Zachary C. , booktitle =. Learning the. 2020 , note =

  49. [49]

    Kirichenko, Polina and Izmailov, Pavel and Wilson, Andrew Gordon , booktitle =. Last. 2023 , note =

  50. [50]

    ArXiv , year=

    Rethinking Benchmark and Contamination for Language Models with Rephrased Samples , author=. ArXiv , year=

  51. [51]

    Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , booktitle =. Large. 2022 , note =

  52. [52]

    Deep Reinforcement Learning from Human Preferences , url =

    Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario , booktitle =. Deep Reinforcement Learning from Human Preferences , url =

  53. [53]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  54. [54]

    Dropout as a

    Gal, Yarin and Ghahramani, Zoubin , booktitle=. Dropout as a. 2016 , organization=

  55. [55]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Self-Refine: Iterative Refinement with Self-Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  56. [56]

    International Conference on Learning Representations (ICLR) , year =

    Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization , author =. International Conference on Learning Representations (ICLR) , year =

  57. [57]

    International Conference on Learning Representations (ICLR) , year=

    Residual energy-based models for text generation , author=. International Conference on Learning Representations (ICLR) , year=

  58. [58]

    The Thirteenth International Conference on Learning Representations , year=

    Mixture-of-Agents Enhances Large Language Model Capabilities , author=. The Thirteenth International Conference on Learning Representations , year=

  59. [59]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Attention Is All You Need , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  60. [60]

    2019 , note =

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =. 2019 , note =

  61. [61]

    arXiv: Learning , year=

    Gaussian Error Linear Units (GELUs) , author=. arXiv: Learning , year=

  62. [62]

    ArXiv , year=

    Layer Normalization , author=. ArXiv , year=

  63. [63]

    2017 , note =

    Loshchilov, Ilya and Hutter, Frank , booktitle =. 2017 , note =

  64. [64]

    Holtzman, Ari and Buys, Jan and Du, Li and Forbes, Maxwell and Choi, Yejin , booktitle =. The. 2020 , note =

  65. [65]

    International conference on machine learning , pages=

    On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

  66. [66]

    Nature Machine Intelligence , volume=

    Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=

  67. [67]

    Evaluating Large Language Models Trained on Code

    Evaluating Large Language Models Trained on Code , author=. arxiv:2107.03374 , year=

  68. [68]

    Machine learning , volume=

    Bagging predictors , author=. Machine learning , volume=. 1996 , publisher=

  69. [69]

    LeCun, Yann and Chopra, Sumit and Hadsell, Raia and Ranzato, Marc'Aurelio and Huang, Fu Jie , journal =. A. 2006 , publisher =

  70. [70]

    Skywork-Reward: Bag of Tricks for Reward Modeling in

    Liu, Chris Yuhao and Zeng, Liang and Liu, Jiacai and Yan, Rui and He, Jujie and Wang, Chaojie and Yan, Shuicheng and Liu, Yang and Zhou, Yahui , journal=. Skywork-Reward: Bag of Tricks for Reward Modeling in

  71. [71]

    Math-Shepherd: Verify and Reinforce

    Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang , booktitle=. Math-Shepherd: Verify and Reinforce

  72. [72]

    and Zhang, Hao and Gonzalez, Joseph E

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle=. Judging