pith. sign in

arxiv: 2510.04265 · v4 · submitted 2025-10-05 · 💻 cs.AI · cs.CL· math.ST· stat.ML· stat.TH

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Pith reviewed 2026-05-18 10:00 UTC · model grok-4.3

classification 💻 cs.AI cs.CLmath.STstat.MLstat.TH
keywords Bayesian evaluationLLM benchmarkingPass@kDirichlet priorrank stabilitycredible intervalsmodel ranking
0
0 comments X

The pith

Bayesian posterior estimates of success probability replace Pass@k to yield stable LLM rankings with explicit uncertainty at small sample sizes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that Pass@k and similar metrics produce unstable rankings for large language models when sample sizes are small due to high variance. It develops a Bayesian framework that treats each trial outcome as a categorical variable drawn from an unknown probability distribution equipped with a Dirichlet prior. Closed-form updates then give the posterior mean for the success probability and credible intervals that quantify uncertainty. This leads to faster convergence to the correct model ordering and more stable ranks than Pass@k in simulations with known ground truth as well as on recent math competition benchmarks. The approach also supplies a direct way to determine whether differences between models are meaningful and extends straightforwardly to rubric-based scoring beyond simple binary correctness.

Core claim

Evaluation outcomes are modeled as categorical with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass@1). Empirically, the posterior-based procedure achieves faster convergence and greater rank stability than Pass@k and recent variants on simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful via non-overlapping credible intervals and naturally extends

What carries the argument

The Dirichlet-Multinomial posterior distribution over categorical success probabilities, which supplies closed-form expressions for the mean and credible intervals of any weighted rubric under the Bayesian model.

If this is right

  • Reliable model comparisons become possible with far smaller numbers of samples than currently required by Pass@k.
  • Non-overlapping credible intervals serve as a transparent rule for declaring performance differences statistically meaningful.
  • The same framework applies directly to both binary correctness and graded or rubric-scored evaluations.
  • Prior evidence from previous evaluations can be incorporated through the choice of Dirichlet parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This protocol could lower the computational cost of large-scale LLM benchmarking by reducing the number of required model calls per evaluation.
  • Hierarchical Bayesian extensions might further improve estimates by sharing statistical strength across related tasks or model families.
  • The same treatment of stochastic outcomes could be applied to evaluation in other domains such as reinforcement learning or automated theorem proving.

Load-bearing premise

Each model's performance on a given task can be summarized by a single fixed but unknown success probability from which trials are independent draws.

What would settle it

A simulation or benchmark experiment with known ground-truth success rates in which the posterior-based procedure does not converge faster or produce more stable ranks than Pass@k at the same sample counts.

Figures

Figures reproduced from arXiv: 2510.04265 by Amirhossein Samandar, Michael Hinczewski, Mohsen Hariri, Vipin Chaudhary.

Figure 1
Figure 1. Figure 1: a) Probability mass functions (PMFs) of convergence@ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Kendall’s τ rank correlation for various evaluation methods compared to the true ranking of 11 sets of biased coins (LLM mimics) with known mean success probabilities π¯ = 0.2332, 0.2545, 0.3604, 0.3642, 0.3642, 0.4466, 0.5418, 0.5276, 0.608 , 0.6213, 0.7327. The simulation evaluates methods including Pass@k (k = 2, 4, 8), Bayes@N, naive Passˆk, G-Pass@kτ˜ (τ˜ = 0.5), and mG-Pass@k across 1 to 80 trials. P… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Histogram of Kendall τ values comparing original ranking of synthetic LLM models and 50k replicates of updated models. (b) Mean Kendall τ between the estimated and true ranking for the updated models (50k replicates) as a function of N, the number of trials. The dashed line corresponds estimates using Bayes@N with a uniform prior (D = 0), while the solid lines are Bayes@N with a non-uniform prior and d… view at source ↗
Figure 4
Figure 4. Figure 4: At N = 80, the probability of obtaining the correct ranking is 83.7%. The right panel plots the absolute z-score 7 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Probability of correctly ranking LLM10 above LLM9 using Bayes@N in the biased-coin simulations, shown as a function of trial count N. The probability is 83.7% at N = 80, increases to ∼ 94.7% at N = 199, and reaches 96.9% at N = 285. (b) Corresponding absolute z-scores as a function of N, with values of ∼ 1.14 at N = 80, 1.645 at N = 199 (95% confidence), and 1.96 at N = 285 (97.5% confidence). 3 EXPERI… view at source ↗
Figure 5
Figure 5. Figure 5: Average Kendall’s τ correlation between rankings produced by various evaluation methods and the gold standard (derived from Bayes@80, or equivalently avg@80), as a function of the number of trials N. Results are averaged over 104 bootstrapped resamples for each dataset: (a) AIME’25, (b) AIME’24, (c) HMMT’25, and (d) BrUMO’25. Methods include Bayesian estimation Bayes@N , Pass@k (k = 2, 4, 8), naive Passˆk,… view at source ↗
Figure 6
Figure 6. Figure 6: Worst-case rank trajectories. Each colored line tracks a model’s rank as trials are added (x-axis), across 105 bootstrap replications. Convergence is the minimal N after which the ranking remains unchanged. Top row (11 models): AIME’24 (N=75), AIME’25 (no convergence within 80), HMMT’25 (N=78), and BrUMO’25 (N=68). Bottom row (20 models): each benchmark has at least one no-convergence replicate within 80 t… view at source ↗
Figure 7
Figure 7. Figure 7: Convergence@n without CI. Mean convergence@n across model combinations for AIME’24, AIME’25, HMMT’25, and BrUMO’25. Top: 50 combinations of 5 models. Bottom-left: 20 combinations of 10 models. Bottom-right: 20 combinations of 15 models. Color indicates the mean convergence@n over 105 bootstrap replicates (green: fast convergence; red: slow convergence). Exact Match Format Aware Conf-Calibrated OOD Robustne… view at source ↗
Figure 8
Figure 8. Figure 8: Sensitivity of model rankings to the categorical scoring schema. For each schema variant (x-axis; see [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Computational cost analysis. (Left) Total inference time in hours aggregated over 80 trials and 30 questions per benchmark (2,400 inference runs per cell). (Right) Total number of completion tokens (in thousands) generated across the same runs. Models are ordered by overall performance (best to worst, top to bottom). Task-level computational cost. HMMT’25 is the most expensive benchmark in terms of GPU tim… view at source ↗
Figure 10
Figure 10. Figure 10: CDF of convergence@n. Complementing the PMFs in [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
read the original abstract

Pass$@k$ is widely used to report the reasoning performance of LLMs, but it often produces unstable and potentially misleading rankings, especially when the number of trials (samples) is limited and computational resources are constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the posterior-based procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Source code is available at https://github.com/mohsenhariri/scorio

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a Bayesian evaluation framework for LLMs that models outcomes as categorical draws under a Dirichlet prior, replacing Pass@k and avg@N with posterior means and credible intervals. It claims that under a uniform prior the posterior mean is order-equivalent to average accuracy, and that the approach yields faster convergence, greater rank stability, and clearer significance tests than Pass@k or variants on both synthetic simulations with known ground-truth rates and real math-competition benchmarks (AIME'24/'25, HMMT'25, BrUMO'25).

Significance. If the central claims hold, the framework would offer a principled, compute-efficient alternative to current LLM benchmarking practice, enabling reliable model comparisons at substantially smaller sample sizes while making uncertainty explicit and extending naturally to rubric-based scoring.

major comments (2)
  1. [Model and Empirical Evaluation] The Dirichlet-Multinomial model (abstract and model section) assumes i.i.d. trials from a single fixed success-probability vector per model. Competition problems exhibit heterogeneous difficulty, so observed outcomes are more plausibly a mixture of Bernoullis; under this misspecification the posterior no longer correctly calibrates uncertainty and the reported gains in convergence and rank stability may be artifacts of the i.i.d. simulation regime used to generate the ground-truth comparisons.
  2. [Empirical Evaluation] The empirical claims of faster convergence and rank stability rest on simulations generated under the same fixed-p i.i.d. regime that the model assumes (abstract). No robustness checks or alternative generative processes (e.g., difficulty-varying mixtures) are reported, leaving open whether the advantage persists on data that violate the modeling assumption.
minor comments (1)
  1. [Abstract] The abstract states that source code is available but does not specify the exact sample sizes, number of models, or statistical tests used to quantify 'faster convergence' and 'greater rank stability' on the named benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below, clarifying the modeling assumptions, the role of simulations versus real benchmarks, and our plans for revisions.

read point-by-point responses
  1. Referee: The Dirichlet-Multinomial model (abstract and model section) assumes i.i.d. trials from a single fixed success-probability vector per model. Competition problems exhibit heterogeneous difficulty, so observed outcomes are more plausibly a mixture of Bernoullis; under this misspecification the posterior no longer correctly calibrates uncertainty and the reported gains in convergence and rank stability may be artifacts of the i.i.d. simulation regime used to generate the ground-truth comparisons.

    Authors: We agree that real competition problems have heterogeneous difficulties, implying that observed successes arise from a mixture of Bernoulli distributions rather than i.i.d. draws from a single fixed success probability. Our framework is designed to estimate a model's expected success rate under the distribution of problems encountered in evaluation, with the Dirichlet prior providing regularization that improves estimate stability compared to raw Pass@k or avg@N. While the i.i.d. assumption is an approximation, the posterior mean remains order-equivalent to average accuracy under a uniform prior, and the credible intervals offer a principled way to assess differences. Importantly, the empirical results on AIME'24/'25, HMMT'25, and BrUMO'25 already reflect heterogeneous problem difficulties, and the observed improvements in rank stability and convergence there support the practical utility of the approach beyond the simulation regime. We will add a dedicated discussion subsection on modeling assumptions, potential misspecification effects on uncertainty calibration, and why the method remains useful for ranking even under heterogeneity. revision: partial

  2. Referee: The empirical claims of faster convergence and rank stability rest on simulations generated under the same fixed-p i.i.d. regime that the model assumes (abstract). No robustness checks or alternative generative processes (e.g., difficulty-varying mixtures) are reported, leaving open whether the advantage persists on data that violate the modeling assumption.

    Authors: We acknowledge that the primary simulation experiments use a fixed-p i.i.d. generative process to enable exact ground-truth comparisons for convergence analysis. However, the real-benchmark evaluations on AIME, HMMT, and BrUMO inherently involve varying problem difficulties and thus serve as a partial robustness check. To directly address the concern, we will add new simulation experiments in the revised manuscript that generate data from heterogeneous difficulty models (e.g., success probabilities drawn from a Beta distribution or a finite mixture of Bernoullis per problem). These will compare convergence rates, rank stability, and credible-interval coverage against Pass@k and avg@N under misspecification, allowing us to quantify whether the advantages persist. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core derivation applies standard Dirichlet-Multinomial conjugacy to obtain closed-form posterior means and credible intervals from a uniform prior. The stated order-equivalence between the uniform-prior posterior mean and average accuracy (Pass@1) is a direct algebraic consequence of the Beta or Dirichlet update formulas and does not reduce any ranking result to a fitted parameter or self-referential definition. Empirical comparisons on simulations (with known ground-truth rates) and real benchmarks (AIME, HMMT, BrUMO) are presented as external validation rather than by construction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the derivation chain. The framework remains self-contained against external statistical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard conjugate Bayesian updating for categorical data; no new free parameters are introduced beyond the choice of uniform Dirichlet prior, which is a conventional non-informative choice.

axioms (1)
  • domain assumption Evaluation outcomes are i.i.d. draws from a categorical distribution whose probability vector is fixed for a given model and task.
    This modeling assumption enables the Dirichlet-Multinomial conjugate posterior used throughout the framework.

pith-pipeline@v0.9.0 · 5839 in / 1237 out tokens · 40432 ms · 2026-05-18T10:00:47.560219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

    cs.LG 2026-05 unverdicted novelty 7.0

    CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 1 Pith paper · 38 internal anchors

  1. [1]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017. URLhttps://arxiv.org/abs/1706.03762

  2. [2]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 2020. URLhttps://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

  3. [3]

    Stack Overflow Developer Survey 2025: AI and Developer Tools, 2025

    StackOverflow. Stack Overflow Developer Survey 2025: AI and Developer Tools, 2025. URLhttps:// survey.stackoverflow.co/2025/ai. Accessed: 2025-09-24

  4. [4]

    21 NVIDIA Corporation

    Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Cap- stick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, et al. Artificial intelligence index report 2025.arXiv preprint arXiv:2504.07139, 2025. URLhttps://arxiv.org/abs/2504.07139

  5. [5]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, et al. Holistic evaluation of language models.arXiv:2211.09110, 2022. URL https://arxiv.org/abs/2211.09110

  6. [6]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, et al. Measuring massive multitask language understanding. In International Conference on Learning Representations (ICLR), 2021. URLhttps://arxiv.org/abs/ 2009.03300

  7. [7]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, et al. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (BIG-bench).arXiv:2206.04615, 2022. URLhttps://arxiv.org/abs/ 2206.04615

  8. [8]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, et al. Scaling laws for neural language models. arXiv:2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361

  9. [9]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training compute-optimal large language models. arXiv:2203.15556, 2022. URLhttps://arxiv.org/abs/2203.15556

  10. [10]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/ pdf?id=_VjQlMeSB_J

  11. [11]

    Training language models to follow instruc- tions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, et al. Training language models to follow instruc- tions with human feedback. InAdvances in Neural Information Processing Systems,

  12. [12]

    URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ b1efde53be364a73914f58805a001731-Paper-Conference.pdf

  13. [13]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, 2022. URLhttps://arxiv.org/abs/2208.07339

  14. [14]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers, 2022. URLhttps://arxiv.org/abs/2210.17323

  15. [15]

    Learning both weights and connections for efficient neural networks

    Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural networks. InNeurIPS, 2015. URLhttps://papers.nips.cc/paper/ 5784-learning-both-weights-and-connections-for-efficient-neural-network

  16. [16]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015. URL https://arxiv.org/abs/1503.02531

  17. [17]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAtten- tion. InSOSP, 2023. URLhttps://arxiv.org/abs/2309.06180. 15

  18. [18]

    Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization

    Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, and Anshumali Shrivastava. Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization. In A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Ad- vances in Neural Information Processing Systems, volume 37, pages 3304–3331. Curran Associates...

  19. [19]

    Pqcache: Product quantization-based kvcache for long context llm inference.Proc

    Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. Pqcache: Product quantization-based kvcache for long context llm inference.Proc. ACM Manag. Data, 3 (3), June 2025. doi: 10.1145/3725338. URLhttps://doi.org/10.1145/3725338

  20. [20]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models.arXiv preprint arXiv:2106.09685, 2021. URL https://arxiv.org/abs/2106.09685

  21. [21]

    Deep reinforcement learning from human preferences

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InNeurIPS, 2017. URLhttps://arxiv.org/abs/1706.03741

  22. [23]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. URLhttps://arxiv.org/abs/2307.08691

  23. [24]

    Evaluating large language models trained on code,

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Evaluating large language models trained on code,

  24. [26]

    Mathematical Association of America, 2025

    American Invitational Mathematics Examination (AIME) — official description. Mathematical Association of America, 2025. URLhttps://maa.org/maa-invitational-competitions/. 15 questions, 3 hours

  25. [27]

    A sober look at progress in language model reasoning: Pitfalls and paths to repro- ducibility.arXiv preprint arXiv:2504.07086,

    Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, and Matthias Bethge. A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility.arXiv preprint arXiv:2504.07086, 2025. URLhttps://arxiv.org/abs/2504.07086

  26. [29]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InICLR, 2020. URLhttps://openreview.net/forum?id=rygGQyrFvH. arXiv:1904.09751 (2019)

  27. [30]

    The Hitchhiker’s Guide to Testing Statistical Significance in NLP

    Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. The Hitchhiker’s Guide to Testing Statistical Significance in NLP. InACL, pages 1383–1392, 2018. URLhttps://aclanthology.org/P18-1128/

  28. [31]

    More accurate tests for the statistical significance of result differences

    Alexander Yeh. More accurate tests for the statistical significance of result differences. InCOLING, 2000. URL https://aclanthology.org/C00-2137/

  29. [32]

    Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Im- proved reporting of experimental results. InEMNLP-IJCNLP, 2019. URLhttps://aclanthology.org/ D19-1224/

  30. [33]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Yingbo Sheng, and et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.arXiv preprint arXiv:2306.05685, 2023. URLhttps://arxiv.org/abs/2306.05685

  31. [34]

    Humans or LLMs as the Judge? A Study on Judgement Bias

    Guande Chen, Kai Shen, Saurav Shah, and et al. Humans or LLMs as the Judge? A Study on Judgement Bias. InEMNLP, 2024. URLhttps://aclanthology.org/2024.emnlp-main.474.pdf

  32. [35]

    Confidence in large language model evaluation: A bayesian approach to limited-sample challenges, 2025

    Xiao Xiao, Yu Su, Sijing Zhang, Zhang Chen, Yadong Chen, and Tian Liu. Confidence in large language model evaluation: A bayesian approach to limited-sample challenges, 2025. URLhttps://arxiv.org/abs/ 2504.21303

  33. [36]

    Straightforward bayesian a/b testing with dirichlet posteriors.arXiv preprint arXiv:2508.08077, 2025

    Dustin Hayden and Thomas Armitage. Straightforward bayesian a/b testing with dirichlet posteriors.arXiv preprint arXiv:2508.08077, 2025. URLhttps://arxiv.org/abs/2508.08077. 16

  34. [37]

    American invitational mathematics examination (aime).https:// maa.org/maa-invitational-competitions/, 2024

    Mathematical Association of America. American invitational mathematics examination (aime).https:// maa.org/maa-invitational-competitions/, 2024. Official MAA page for the AIME competition (covers AIME 2024)

  35. [38]

    American invitational mathematics examination (aime).https:// maa.org/maa-invitational-competitions/, 2025

    Mathematical Association of America. American invitational mathematics examination (aime).https:// maa.org/maa-invitational-competitions/, 2025. Official MAA page for the AIME competition (covers AIME 2025)

  36. [39]

    Hmmt february 2025 archive (problems and solutions).https:// www.hmmt.org/www/archive/282, 2025

    Harvard–MIT Mathematics Tournament. Hmmt february 2025 archive (problems and solutions).https:// www.hmmt.org/www/archive/282, 2025. Official HMMT archive page for February 2025 competition

  37. [40]

    Brown university math olympiad (brumo).https://www

    Brown University Math Olympiad Organizers. Brown university math olympiad (brumo).https://www. brumo.org/tournament-info, 2025. Official BrUMO website with tournament information (Apr 4–5, 2025)

  38. [41]

    Leveraging LLM Inconsistency to Boost Pass@ k Performance.arXiv preprint arXiv:2505.12938, 2025

    Uri Dalal, Meirav Segal, Zvika Ben-Haim, Dan Lahav, and Omer Nevo. Leveraging LLM Inconsistency to Boost Pass@ k Performance.arXiv preprint arXiv:2505.12938, 2025. URLhttps://arxiv.org/abs/2505. 12938

  39. [42]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URLhttps://arxiv.org/abs/2107.03374

  40. [43]

    Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems

    Brendan Leigh Ross, Noel V ouitsis, Atiyeh Ashari Ghomi, Rasa Hosseinzadeh, Ji Xin, Zhaoyan Liu, Yi Sui, Shiyi Hou, Kin Kwan Leung, Gabriel Loaiza-Ganem, et al. Textual Bayes: Quantifying Uncertainty in LLM- Based Systems.arXiv preprint arXiv:2506.10060, 2025. URLhttps://arxiv.org/abs/2506.10060

  41. [44]

    Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency.arXiv preprint arXiv:2502.04964, 2025

    Roman Vashurin, Maiya Goloburda, Albina Ilina, Aleksandr Rubashevskii, Preslav Nakov, Artem Shelmanov, and Maxim Panov. Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency.arXiv preprint arXiv:2502.04964, 2025. URLhttps://arxiv.org/abs/2502.04964

  42. [45]

    Cambridge university press, 2003

    Edwin T Jaynes.Probability theory: The logic of science. Cambridge university press, 2003

  43. [46]

    Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints.arXiv preprint arXiv:2503.01747, 2025

    Sam Bowyer, Laurence Aitchison, and Desi R Ivanova. Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints.arXiv preprint arXiv:2503.01747, 2025. URLhttps://arxiv.org/ abs/2503.01747

  44. [47]

    Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187, 2024. URLhttps://arxiv.org/abs/2412.21187

  45. [48]

    Wong, Songyang Zhang, and Kai Chen

    Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F. Wong, Songyang Zhang, and Kai Chen. Compassverifier: A unified and robust verifier for llms evalua- tion and outcome reward, 2025. URLhttps://arxiv.org/abs/2508.03686

  46. [49]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. URLhttps://arxiv.org/abs/2501.12948

  47. [50]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300

  48. [51]

    Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving

    Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tom- czak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 7821–7846. Curran Associates, Inc., 2024. URLhttps...

  49. [52]

    Tinygsm: achieving ¿80% on gsm8k with small language models

    Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. TinyGSM: Achieving 80% on GSM8K with small language models, 2023. URLhttps: //arxiv.org/abs/2312.09241. 17

  50. [53]

    Self-explore: Enhancing mathematical reasoning in large language models by finding the first pit

    Hyeongdon Hwang et al. Self-explore: Enhancing mathematical reasoning in large language models by finding the first pit. InFindings of EMNLP, 2024. URLhttps://aclanthology.org/2024. findings-emnlp.78/

  51. [54]

    Weak-to-strong reasoning

    Yan Yang et al. Weak-to-strong reasoning. InFindings of EMNLP, 2024. URLhttps://aclanthology. org/2024.findings-emnlp.490/

  52. [55]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URLhttps: //arxiv.org/abs/2501.19393

  53. [56]

    Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning, 2025

    Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, and Shaul Druckmann. Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning, 2025. URLhttps:// arxiv.org/abs/2502.07154

  54. [57]

    Are your llms capable of stable reasoning?arXiv preprint arXiv:2412.13147, 2024

    Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your llms capable of stable reasoning?, 2025. URLhttps://arxiv.org/abs/ 2412.13147

  55. [58]

    Exaone deep: Reasoning enhanced language models, 2025

    LG AI Research, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyun...

  56. [59]

    Effective red- teaming of policy-adherent agents, 2025

    Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, and Ateret Anaby-Tavor. Effective red- teaming of policy-adherent agents, 2025. URLhttps://arxiv.org/abs/2506.09600

  57. [60]

    Trojanpuzzle: Covertly poisoning code-suggestion models,

    Hojjat Aghakhani, Wei Dai, Andre Manoel, Xavier Fernandes, Anant Kharkar, Christopher Kruegel, Giovanni Vigna, David Evans, Ben Zorn, and Robert Sim. Trojanpuzzle: Covertly poisoning code-suggestion models,

  58. [61]

    URLhttps://arxiv.org/abs/2301.02344

  59. [62]

    LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem.arXiv preprint arXiv:2403.00108, 2024

    Hongyi Liu, Shaochen Zhong, Xintong Sun, Minghao Tian, Mohsen Hariri, Zirui Liu, Ruixiang Tang, Zhimeng Jiang, Jiayi Yuan, Yu-Neng Chuang, et al. LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem.arXiv preprint arXiv:2403.00108, 2024. URLhttps://arxiv.org/abs/2403.00108

  60. [63]

    An llm-assisted easy-to-trigger backdoor attack on code completion models: Injecting disguised vulnerabilities against strong detection, 2024

    Shenao Yan, Shen Wang, Yue Duan, Hanbin Hong, Kiho Lee, Doowon Kim, and Yuan Hong. An llm-assisted easy-to-trigger backdoor attack on code completion models: Injecting disguised vulnerabilities against strong detection, 2024. URLhttps://arxiv.org/abs/2406.06822

  61. [64]

    Rtl-breaker: Assessing the security of llms against backdoor attacks on hdl code generation, 2024

    Lakshmi Likhitha Mankali, Jitendra Bhandari, Manaar Alam, Ramesh Karri, Michail Maniatakos, Ozgur Sinanoglu, and Johann Knechtel. Rtl-breaker: Assessing the security of llms against backdoor attacks on hdl code generation, 2024. URLhttps://arxiv.org/abs/2411.17569

  62. [65]

    How do large language monkeys get their power (laws)?, 2025

    Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, and Sanmi Koyejo. How do large language monkeys get their power (laws)?, 2025. URLhttps://arxiv.org/abs/2502.17578

  63. [66]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent- user interaction in real-world domains.arXiv preprint, 2024. doi: 10.48550/arXiv.2406.12045. URLhttps: //doi.org/10.48550/arXiv.2406.12045. Introduces the pass k metric

  64. [67]

    Are your llms capable of stable reasoning? InFindings of ACL, 2025

    Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your llms capable of stable reasoning? InFindings of ACL, 2025. URLhttps: //aclanthology.org/2025.findings-acl.905/. Camera-ready version detailing G-Pass@k τ and mG-Pass

  65. [68]

    Lessons from the Trenches on Reproducible Evaluation of Language Models

    Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y . Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick...

  66. [70]

    URLhttps://arxiv.org/abs/2505.21972

  67. [71]

    Prediction- powered inference.Science, 382(6671):669–674, 2023

    Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction- powered inference.Science, 382(6671):669–674, 2023

  68. [72]

    Reliable confidence intervals for information retrieval evaluation using generative ai

    Harrie Oosterhuis, Rolf Jagerman, Zhen Qin, Xuanhui Wang, and Michael Bendersky. Reliable confidence intervals for information retrieval evaluation using generative ai. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2307–2317, 2024

  69. [73]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019. URLhttps://arxiv.org/abs/1910.03771

  70. [74]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  71. [75]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024. URLhttps://arxiv.org/abs/2407.21787

  72. [76]

    doi:10.3115/1073083.1073135 , editor =

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors,Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computa...

  73. [77]

    CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambro- sio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297, 2020. URLhttps://arxiv.org/abs/2009.10297

  74. [78]

    SPoC: Search-based Pseudocode to Code

    Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. Spoc: Search-based pseudocode to code.Advances in Neural Information Processing Systems, 32, 2019. URL https://arxiv.org/abs/1906.04908

  75. [80]

    URLhttps://arxiv.org/abs/2103.03874

  76. [81]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168

  77. [82]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022. URLhttps://arxiv.org/abs/2203.11171

  78. [83]

    Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Am- brose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

  79. [84]

    M. G. KENDALL. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 06 1938. ISSN 0006-3444. doi: 10.1093/biomet/30.1-2.81. URLhttps://doi.org/10.1093/biomet/30.1-2.81

  80. [85]

    Think less, achieve more: Cut reasoning costs by 50 URLhttps://novasky-ai

    NovaSky Team. Think less, achieve more: Cut reasoning costs by 50 URLhttps://novasky-ai. github.io/posts/reduce-overthinking. Accessed: 2025-01-23

Showing first 80 references.