Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
Pith reviewed 2026-05-18 10:00 UTC · model grok-4.3
The pith
Bayesian posterior estimates of success probability replace Pass@k to yield stable LLM rankings with explicit uncertainty at small sample sizes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluation outcomes are modeled as categorical with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass@1). Empirically, the posterior-based procedure achieves faster convergence and greater rank stability than Pass@k and recent variants on simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful via non-overlapping credible intervals and naturally extends
What carries the argument
The Dirichlet-Multinomial posterior distribution over categorical success probabilities, which supplies closed-form expressions for the mean and credible intervals of any weighted rubric under the Bayesian model.
If this is right
- Reliable model comparisons become possible with far smaller numbers of samples than currently required by Pass@k.
- Non-overlapping credible intervals serve as a transparent rule for declaring performance differences statistically meaningful.
- The same framework applies directly to both binary correctness and graded or rubric-scored evaluations.
- Prior evidence from previous evaluations can be incorporated through the choice of Dirichlet parameters.
Where Pith is reading between the lines
- This protocol could lower the computational cost of large-scale LLM benchmarking by reducing the number of required model calls per evaluation.
- Hierarchical Bayesian extensions might further improve estimates by sharing statistical strength across related tasks or model families.
- The same treatment of stochastic outcomes could be applied to evaluation in other domains such as reinforcement learning or automated theorem proving.
Load-bearing premise
Each model's performance on a given task can be summarized by a single fixed but unknown success probability from which trials are independent draws.
What would settle it
A simulation or benchmark experiment with known ground-truth success rates in which the posterior-based procedure does not converge faster or produce more stable ranks than Pass@k at the same sample counts.
Figures
read the original abstract
Pass$@k$ is widely used to report the reasoning performance of LLMs, but it often produces unstable and potentially misleading rankings, especially when the number of trials (samples) is limited and computational resources are constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the posterior-based procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Source code is available at https://github.com/mohsenhariri/scorio
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Bayesian evaluation framework for LLMs that models outcomes as categorical draws under a Dirichlet prior, replacing Pass@k and avg@N with posterior means and credible intervals. It claims that under a uniform prior the posterior mean is order-equivalent to average accuracy, and that the approach yields faster convergence, greater rank stability, and clearer significance tests than Pass@k or variants on both synthetic simulations with known ground-truth rates and real math-competition benchmarks (AIME'24/'25, HMMT'25, BrUMO'25).
Significance. If the central claims hold, the framework would offer a principled, compute-efficient alternative to current LLM benchmarking practice, enabling reliable model comparisons at substantially smaller sample sizes while making uncertainty explicit and extending naturally to rubric-based scoring.
major comments (2)
- [Model and Empirical Evaluation] The Dirichlet-Multinomial model (abstract and model section) assumes i.i.d. trials from a single fixed success-probability vector per model. Competition problems exhibit heterogeneous difficulty, so observed outcomes are more plausibly a mixture of Bernoullis; under this misspecification the posterior no longer correctly calibrates uncertainty and the reported gains in convergence and rank stability may be artifacts of the i.i.d. simulation regime used to generate the ground-truth comparisons.
- [Empirical Evaluation] The empirical claims of faster convergence and rank stability rest on simulations generated under the same fixed-p i.i.d. regime that the model assumes (abstract). No robustness checks or alternative generative processes (e.g., difficulty-varying mixtures) are reported, leaving open whether the advantage persists on data that violate the modeling assumption.
minor comments (1)
- [Abstract] The abstract states that source code is available but does not specify the exact sample sizes, number of models, or statistical tests used to quantify 'faster convergence' and 'greater rank stability' on the named benchmarks.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below, clarifying the modeling assumptions, the role of simulations versus real benchmarks, and our plans for revisions.
read point-by-point responses
-
Referee: The Dirichlet-Multinomial model (abstract and model section) assumes i.i.d. trials from a single fixed success-probability vector per model. Competition problems exhibit heterogeneous difficulty, so observed outcomes are more plausibly a mixture of Bernoullis; under this misspecification the posterior no longer correctly calibrates uncertainty and the reported gains in convergence and rank stability may be artifacts of the i.i.d. simulation regime used to generate the ground-truth comparisons.
Authors: We agree that real competition problems have heterogeneous difficulties, implying that observed successes arise from a mixture of Bernoulli distributions rather than i.i.d. draws from a single fixed success probability. Our framework is designed to estimate a model's expected success rate under the distribution of problems encountered in evaluation, with the Dirichlet prior providing regularization that improves estimate stability compared to raw Pass@k or avg@N. While the i.i.d. assumption is an approximation, the posterior mean remains order-equivalent to average accuracy under a uniform prior, and the credible intervals offer a principled way to assess differences. Importantly, the empirical results on AIME'24/'25, HMMT'25, and BrUMO'25 already reflect heterogeneous problem difficulties, and the observed improvements in rank stability and convergence there support the practical utility of the approach beyond the simulation regime. We will add a dedicated discussion subsection on modeling assumptions, potential misspecification effects on uncertainty calibration, and why the method remains useful for ranking even under heterogeneity. revision: partial
-
Referee: The empirical claims of faster convergence and rank stability rest on simulations generated under the same fixed-p i.i.d. regime that the model assumes (abstract). No robustness checks or alternative generative processes (e.g., difficulty-varying mixtures) are reported, leaving open whether the advantage persists on data that violate the modeling assumption.
Authors: We acknowledge that the primary simulation experiments use a fixed-p i.i.d. generative process to enable exact ground-truth comparisons for convergence analysis. However, the real-benchmark evaluations on AIME, HMMT, and BrUMO inherently involve varying problem difficulties and thus serve as a partial robustness check. To directly address the concern, we will add new simulation experiments in the revised manuscript that generate data from heterogeneous difficulty models (e.g., success probabilities drawn from a Beta distribution or a finite mixture of Bernoullis per problem). These will compare convergence rates, rank stability, and credible-interval coverage against Pass@k and avg@N under misspecification, allowing us to quantify whether the advantages persist. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's core derivation applies standard Dirichlet-Multinomial conjugacy to obtain closed-form posterior means and credible intervals from a uniform prior. The stated order-equivalence between the uniform-prior posterior mean and average accuracy (Pass@1) is a direct algebraic consequence of the Beta or Dirichlet update formulas and does not reduce any ranking result to a fitted parameter or self-referential definition. Empirical comparisons on simulations (with known ground-truth rates) and real benchmarks (AIME, HMMT, BrUMO) are presented as external validation rather than by construction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the derivation chain. The framework remains self-contained against external statistical benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Evaluation outcomes are i.i.d. draws from a categorical distribution whose probability vector is fixed for a given model and task.
Forward citations
Cited by 1 Pith paper
-
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
Reference graph
Works this paper leans on
-
[1]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017. URLhttps://arxiv.org/abs/1706.03762
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, et al
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 2020. URLhttps://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
work page 2020
-
[3]
Stack Overflow Developer Survey 2025: AI and Developer Tools, 2025
StackOverflow. Stack Overflow Developer Survey 2025: AI and Developer Tools, 2025. URLhttps:// survey.stackoverflow.co/2025/ai. Accessed: 2025-09-24
work page 2025
-
[4]
Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Cap- stick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, et al. Artificial intelligence index report 2025.arXiv preprint arXiv:2504.07139, 2025. URLhttps://arxiv.org/abs/2504.07139
-
[5]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, et al. Holistic evaluation of language models.arXiv:2211.09110, 2022. URL https://arxiv.org/abs/2211.09110
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, et al. Measuring massive multitask language understanding. In International Conference on Learning Representations (ICLR), 2021. URLhttps://arxiv.org/abs/ 2009.03300
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Aarohi Srivastava, Abhinav Rastogi, et al. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (BIG-bench).arXiv:2206.04615, 2022. URLhttps://arxiv.org/abs/ 2206.04615
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, et al. Scaling laws for neural language models. arXiv:2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[9]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training compute-optimal large language models. arXiv:2203.15556, 2022. URLhttps://arxiv.org/abs/2203.15556
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/ pdf?id=_VjQlMeSB_J
work page 2022
-
[11]
Training language models to follow instruc- tions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, et al. Training language models to follow instruc- tions with human feedback. InAdvances in Neural Information Processing Systems,
-
[12]
URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ b1efde53be364a73914f58805a001731-Paper-Conference.pdf
work page 2022
-
[13]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, 2022. URLhttps://arxiv.org/abs/2208.07339
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers, 2022. URLhttps://arxiv.org/abs/2210.17323
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Learning both weights and connections for efficient neural networks
Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural networks. InNeurIPS, 2015. URLhttps://papers.nips.cc/paper/ 5784-learning-both-weights-and-connections-for-efficient-neural-network
work page 2015
-
[16]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015. URL https://arxiv.org/abs/1503.02531
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[17]
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAtten- tion. InSOSP, 2023. URLhttps://arxiv.org/abs/2309.06180. 15
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization
Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, and Anshumali Shrivastava. Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization. In A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Ad- vances in Neural Information Processing Systems, volume 37, pages 3304–3331. Curran Associates...
work page 2024
-
[19]
Pqcache: Product quantization-based kvcache for long context llm inference.Proc
Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. Pqcache: Product quantization-based kvcache for long context llm inference.Proc. ACM Manag. Data, 3 (3), June 2025. doi: 10.1145/3725338. URLhttps://doi.org/10.1145/3725338
-
[20]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models.arXiv preprint arXiv:2106.09685, 2021. URL https://arxiv.org/abs/2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Deep reinforcement learning from human preferences
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InNeurIPS, 2017. URLhttps://arxiv.org/abs/1706.03741
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[23]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. URLhttps://arxiv.org/abs/2307.08691
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Evaluating large language models trained on code,
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Evaluating large language models trained on code,
-
[26]
Mathematical Association of America, 2025
American Invitational Mathematics Examination (AIME) — official description. Mathematical Association of America, 2025. URLhttps://maa.org/maa-invitational-competitions/. 15 questions, 3 hours
work page 2025
-
[27]
Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, and Matthias Bethge. A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility.arXiv preprint arXiv:2504.07086, 2025. URLhttps://arxiv.org/abs/2504.07086
-
[29]
The Curious Case of Neural Text Degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InICLR, 2020. URLhttps://openreview.net/forum?id=rygGQyrFvH. arXiv:1904.09751 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[30]
The Hitchhiker’s Guide to Testing Statistical Significance in NLP
Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. The Hitchhiker’s Guide to Testing Statistical Significance in NLP. InACL, pages 1383–1392, 2018. URLhttps://aclanthology.org/P18-1128/
work page 2018
-
[31]
More accurate tests for the statistical significance of result differences
Alexander Yeh. More accurate tests for the statistical significance of result differences. InCOLING, 2000. URL https://aclanthology.org/C00-2137/
work page 2000
-
[32]
Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Im- proved reporting of experimental results. InEMNLP-IJCNLP, 2019. URLhttps://aclanthology.org/ D19-1224/
work page 2019
-
[33]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Yingbo Sheng, and et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.arXiv preprint arXiv:2306.05685, 2023. URLhttps://arxiv.org/abs/2306.05685
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Humans or LLMs as the Judge? A Study on Judgement Bias
Guande Chen, Kai Shen, Saurav Shah, and et al. Humans or LLMs as the Judge? A Study on Judgement Bias. InEMNLP, 2024. URLhttps://aclanthology.org/2024.emnlp-main.474.pdf
work page 2024
-
[35]
Xiao Xiao, Yu Su, Sijing Zhang, Zhang Chen, Yadong Chen, and Tian Liu. Confidence in large language model evaluation: A bayesian approach to limited-sample challenges, 2025. URLhttps://arxiv.org/abs/ 2504.21303
-
[36]
Straightforward bayesian a/b testing with dirichlet posteriors.arXiv preprint arXiv:2508.08077, 2025
Dustin Hayden and Thomas Armitage. Straightforward bayesian a/b testing with dirichlet posteriors.arXiv preprint arXiv:2508.08077, 2025. URLhttps://arxiv.org/abs/2508.08077. 16
-
[37]
Mathematical Association of America. American invitational mathematics examination (aime).https:// maa.org/maa-invitational-competitions/, 2024. Official MAA page for the AIME competition (covers AIME 2024)
work page 2024
-
[38]
Mathematical Association of America. American invitational mathematics examination (aime).https:// maa.org/maa-invitational-competitions/, 2025. Official MAA page for the AIME competition (covers AIME 2025)
work page 2025
-
[39]
Hmmt february 2025 archive (problems and solutions).https:// www.hmmt.org/www/archive/282, 2025
Harvard–MIT Mathematics Tournament. Hmmt february 2025 archive (problems and solutions).https:// www.hmmt.org/www/archive/282, 2025. Official HMMT archive page for February 2025 competition
work page 2025
-
[40]
Brown university math olympiad (brumo).https://www
Brown University Math Olympiad Organizers. Brown university math olympiad (brumo).https://www. brumo.org/tournament-info, 2025. Official BrUMO website with tournament information (Apr 4–5, 2025)
work page 2025
-
[41]
Leveraging LLM Inconsistency to Boost Pass@ k Performance.arXiv preprint arXiv:2505.12938, 2025
Uri Dalal, Meirav Segal, Zvika Ben-Haim, Dan Lahav, and Omer Nevo. Leveraging LLM Inconsistency to Boost Pass@ k Performance.arXiv preprint arXiv:2505.12938, 2025. URLhttps://arxiv.org/abs/2505. 12938
-
[42]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URLhttps://arxiv.org/abs/2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[43]
Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems
Brendan Leigh Ross, Noel V ouitsis, Atiyeh Ashari Ghomi, Rasa Hosseinzadeh, Ji Xin, Zhaoyan Liu, Yi Sui, Shiyi Hou, Kin Kwan Leung, Gabriel Loaiza-Ganem, et al. Textual Bayes: Quantifying Uncertainty in LLM- Based Systems.arXiv preprint arXiv:2506.10060, 2025. URLhttps://arxiv.org/abs/2506.10060
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Roman Vashurin, Maiya Goloburda, Albina Ilina, Aleksandr Rubashevskii, Preslav Nakov, Artem Shelmanov, and Maxim Panov. Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency.arXiv preprint arXiv:2502.04964, 2025. URLhttps://arxiv.org/abs/2502.04964
-
[45]
Cambridge university press, 2003
Edwin T Jaynes.Probability theory: The logic of science. Cambridge university press, 2003
work page 2003
-
[46]
Sam Bowyer, Laurence Aitchison, and Desi R Ivanova. Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints.arXiv preprint arXiv:2503.01747, 2025. URLhttps://arxiv.org/ abs/2503.01747
-
[47]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187, 2024. URLhttps://arxiv.org/abs/2412.21187
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Wong, Songyang Zhang, and Kai Chen
Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F. Wong, Songyang Zhang, and Kai Chen. Compassverifier: A unified and robust verifier for llms evalua- tion and outcome reward, 2025. URLhttps://arxiv.org/abs/2508.03686
-
[49]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. URLhttps://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving
Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tom- czak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 7821–7846. Curran Associates, Inc., 2024. URLhttps...
work page 2024
-
[52]
Tinygsm: achieving ¿80% on gsm8k with small language models
Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. TinyGSM: Achieving 80% on GSM8K with small language models, 2023. URLhttps: //arxiv.org/abs/2312.09241. 17
-
[53]
Self-explore: Enhancing mathematical reasoning in large language models by finding the first pit
Hyeongdon Hwang et al. Self-explore: Enhancing mathematical reasoning in large language models by finding the first pit. InFindings of EMNLP, 2024. URLhttps://aclanthology.org/2024. findings-emnlp.78/
work page 2024
-
[54]
Yan Yang et al. Weak-to-strong reasoning. InFindings of EMNLP, 2024. URLhttps://aclanthology. org/2024.findings-emnlp.490/
work page 2024
-
[55]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URLhttps: //arxiv.org/abs/2501.19393
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, and Shaul Druckmann. Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning, 2025. URLhttps:// arxiv.org/abs/2502.07154
-
[57]
Are your llms capable of stable reasoning?arXiv preprint arXiv:2412.13147, 2024
Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your llms capable of stable reasoning?, 2025. URLhttps://arxiv.org/abs/ 2412.13147
-
[58]
Exaone deep: Reasoning enhanced language models, 2025
LG AI Research, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyun...
-
[59]
Effective red- teaming of policy-adherent agents, 2025
Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, and Ateret Anaby-Tavor. Effective red- teaming of policy-adherent agents, 2025. URLhttps://arxiv.org/abs/2506.09600
-
[60]
Trojanpuzzle: Covertly poisoning code-suggestion models,
Hojjat Aghakhani, Wei Dai, Andre Manoel, Xavier Fernandes, Anant Kharkar, Christopher Kruegel, Giovanni Vigna, David Evans, Ben Zorn, and Robert Sim. Trojanpuzzle: Covertly poisoning code-suggestion models,
- [61]
-
[62]
Hongyi Liu, Shaochen Zhong, Xintong Sun, Minghao Tian, Mohsen Hariri, Zirui Liu, Ruixiang Tang, Zhimeng Jiang, Jiayi Yuan, Yu-Neng Chuang, et al. LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem.arXiv preprint arXiv:2403.00108, 2024. URLhttps://arxiv.org/abs/2403.00108
-
[63]
Shenao Yan, Shen Wang, Yue Duan, Hanbin Hong, Kiho Lee, Doowon Kim, and Yuan Hong. An llm-assisted easy-to-trigger backdoor attack on code completion models: Injecting disguised vulnerabilities against strong detection, 2024. URLhttps://arxiv.org/abs/2406.06822
-
[64]
Rtl-breaker: Assessing the security of llms against backdoor attacks on hdl code generation, 2024
Lakshmi Likhitha Mankali, Jitendra Bhandari, Manaar Alam, Ramesh Karri, Michail Maniatakos, Ozgur Sinanoglu, and Johann Knechtel. Rtl-breaker: Assessing the security of llms against backdoor attacks on hdl code generation, 2024. URLhttps://arxiv.org/abs/2411.17569
-
[65]
How do large language monkeys get their power (laws)?, 2025
Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, and Sanmi Koyejo. How do large language monkeys get their power (laws)?, 2025. URLhttps://arxiv.org/abs/2502.17578
-
[66]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent- user interaction in real-world domains.arXiv preprint, 2024. doi: 10.48550/arXiv.2406.12045. URLhttps: //doi.org/10.48550/arXiv.2406.12045. Introduces the pass k metric
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12045 2024
-
[67]
Are your llms capable of stable reasoning? InFindings of ACL, 2025
Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your llms capable of stable reasoning? InFindings of ACL, 2025. URLhttps: //aclanthology.org/2025.findings-acl.905/. Camera-ready version detailing G-Pass@k τ and mG-Pass
work page 2025
-
[68]
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y . Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
URLhttps://arxiv.org/abs/2505.21972
work page internal anchor Pith review Pith/arXiv arXiv
-
[71]
Prediction- powered inference.Science, 382(6671):669–674, 2023
Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction- powered inference.Science, 382(6671):669–674, 2023
work page 2023
-
[72]
Reliable confidence intervals for information retrieval evaluation using generative ai
Harrie Oosterhuis, Rolf Jagerman, Zhen Qin, Xuanhui Wang, and Michael Bendersky. Reliable confidence intervals for information retrieval evaluation using generative ai. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2307–2317, 2024
work page 2024
-
[73]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019. URLhttps://arxiv.org/abs/1910.03771
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[74]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
work page 2023
-
[75]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024. URLhttps://arxiv.org/abs/2407.21787
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
doi:10.3115/1073083.1073135 , editor =
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors,Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computa...
-
[77]
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambro- sio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297, 2020. URLhttps://arxiv.org/abs/2009.10297
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[78]
SPoC: Search-based Pseudocode to Code
Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. Spoc: Search-based pseudocode to code.Advances in Neural Information Processing Systems, 32, 2019. URL https://arxiv.org/abs/1906.04908
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[80]
URLhttps://arxiv.org/abs/2103.03874
work page internal anchor Pith review Pith/arXiv arXiv
-
[81]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[82]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022. URLhttps://arxiv.org/abs/2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[83]
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Am- brose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022
work page 2022
-
[84]
M. G. KENDALL. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 06 1938. ISSN 0006-3444. doi: 10.1093/biomet/30.1-2.81. URLhttps://doi.org/10.1093/biomet/30.1-2.81
-
[85]
Think less, achieve more: Cut reasoning costs by 50 URLhttps://novasky-ai
NovaSky Team. Think less, achieve more: Cut reasoning costs by 50 URLhttps://novasky-ai. github.io/posts/reduce-overthinking. Accessed: 2025-01-23
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.