Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Amirhossein Samandar; Michael Hinczewski; Mohsen Hariri; Vipin Chaudhary

arxiv: 2510.04265 · v4 · submitted 2025-10-05 · 💻 cs.AI · cs.CL· math.ST· stat.ML· stat.TH

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri , Amirhossein Samandar , Michael Hinczewski , Vipin Chaudhary This is my paper

Pith reviewed 2026-05-18 10:00 UTC · model grok-4.3

classification 💻 cs.AI cs.CLmath.STstat.MLstat.TH

keywords Bayesian evaluationLLM benchmarkingPass@kDirichlet priorrank stabilitycredible intervalsmodel ranking

0 comments

The pith

Bayesian posterior estimates of success probability replace Pass@k to yield stable LLM rankings with explicit uncertainty at small sample sizes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that Pass@k and similar metrics produce unstable rankings for large language models when sample sizes are small due to high variance. It develops a Bayesian framework that treats each trial outcome as a categorical variable drawn from an unknown probability distribution equipped with a Dirichlet prior. Closed-form updates then give the posterior mean for the success probability and credible intervals that quantify uncertainty. This leads to faster convergence to the correct model ordering and more stable ranks than Pass@k in simulations with known ground truth as well as on recent math competition benchmarks. The approach also supplies a direct way to determine whether differences between models are meaningful and extends straightforwardly to rubric-based scoring beyond simple binary correctness.

Core claim

Evaluation outcomes are modeled as categorical with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass@1). Empirically, the posterior-based procedure achieves faster convergence and greater rank stability than Pass@k and recent variants on simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful via non-overlapping credible intervals and naturally extends

What carries the argument

The Dirichlet-Multinomial posterior distribution over categorical success probabilities, which supplies closed-form expressions for the mean and credible intervals of any weighted rubric under the Bayesian model.

If this is right

Reliable model comparisons become possible with far smaller numbers of samples than currently required by Pass@k.
Non-overlapping credible intervals serve as a transparent rule for declaring performance differences statistically meaningful.
The same framework applies directly to both binary correctness and graded or rubric-scored evaluations.
Prior evidence from previous evaluations can be incorporated through the choice of Dirichlet parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This protocol could lower the computational cost of large-scale LLM benchmarking by reducing the number of required model calls per evaluation.
Hierarchical Bayesian extensions might further improve estimates by sharing statistical strength across related tasks or model families.
The same treatment of stochastic outcomes could be applied to evaluation in other domains such as reinforcement learning or automated theorem proving.

Load-bearing premise

Each model's performance on a given task can be summarized by a single fixed but unknown success probability from which trials are independent draws.

What would settle it

A simulation or benchmark experiment with known ground-truth success rates in which the posterior-based procedure does not converge faster or produce more stable ranks than Pass@k at the same sample counts.

Figures

Figures reproduced from arXiv: 2510.04265 by Amirhossein Samandar, Michael Hinczewski, Mohsen Hariri, Vipin Chaudhary.

**Figure 2.** Figure 2: Kendall’s τ rank correlation for various evaluation methods compared to the true ranking of 11 sets of biased coins (LLM mimics) with known mean success probabilities π¯ = 0.2332, 0.2545, 0.3604, 0.3642, 0.3642, 0.4466, 0.5418, 0.5276, 0.608 , 0.6213, 0.7327. The simulation evaluates methods including Pass@k (k = 2, 4, 8), Bayes@N, naive Passˆk, G-Pass@kτ˜ (τ˜ = 0.5), and mG-Pass@k across 1 to 80 trials. P… view at source ↗

**Figure 3.** Figure 3: (a) Histogram of Kendall τ values comparing original ranking of synthetic LLM models and 50k replicates of updated models. (b) Mean Kendall τ between the estimated and true ranking for the updated models (50k replicates) as a function of N, the number of trials. The dashed line corresponds estimates using Bayes@N with a uniform prior (D = 0), while the solid lines are Bayes@N with a non-uniform prior and d… view at source ↗

**Figure 4.** Figure 4: At N = 80, the probability of obtaining the correct ranking is 83.7%. The right panel plots the absolute z-score 7 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 4.** Figure 4: (a) Probability of correctly ranking LLM10 above LLM9 using Bayes@N in the biased-coin simulations, shown as a function of trial count N. The probability is 83.7% at N = 80, increases to ∼ 94.7% at N = 199, and reaches 96.9% at N = 285. (b) Corresponding absolute z-scores as a function of N, with values of ∼ 1.14 at N = 80, 1.645 at N = 199 (95% confidence), and 1.96 at N = 285 (97.5% confidence). 3 EXPERI… view at source ↗

**Figure 5.** Figure 5: Average Kendall’s τ correlation between rankings produced by various evaluation methods and the gold standard (derived from Bayes@80, or equivalently avg@80), as a function of the number of trials N. Results are averaged over 104 bootstrapped resamples for each dataset: (a) AIME’25, (b) AIME’24, (c) HMMT’25, and (d) BrUMO’25. Methods include Bayesian estimation Bayes@N , Pass@k (k = 2, 4, 8), naive Passˆk,… view at source ↗

**Figure 6.** Figure 6: Worst-case rank trajectories. Each colored line tracks a model’s rank as trials are added (x-axis), across 105 bootstrap replications. Convergence is the minimal N after which the ranking remains unchanged. Top row (11 models): AIME’24 (N=75), AIME’25 (no convergence within 80), HMMT’25 (N=78), and BrUMO’25 (N=68). Bottom row (20 models): each benchmark has at least one no-convergence replicate within 80 t… view at source ↗

**Figure 7.** Figure 7: Convergence@n without CI. Mean convergence@n across model combinations for AIME’24, AIME’25, HMMT’25, and BrUMO’25. Top: 50 combinations of 5 models. Bottom-left: 20 combinations of 10 models. Bottom-right: 20 combinations of 15 models. Color indicates the mean convergence@n over 105 bootstrap replicates (green: fast convergence; red: slow convergence). Exact Match Format Aware Conf-Calibrated OOD Robustne… view at source ↗

**Figure 8.** Figure 8: Sensitivity of model rankings to the categorical scoring schema. For each schema variant (x-axis; see [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Computational cost analysis. (Left) Total inference time in hours aggregated over 80 trials and 30 questions per benchmark (2,400 inference runs per cell). (Right) Total number of completion tokens (in thousands) generated across the same runs. Models are ordered by overall performance (best to worst, top to bottom). Task-level computational cost. HMMT’25 is the most expensive benchmark in terms of GPU tim… view at source ↗

**Figure 10.** Figure 10: CDF of convergence@n. Complementing the PMFs in [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

read the original abstract

Pass$@k$ is widely used to report the reasoning performance of LLMs, but it often produces unstable and potentially misleading rankings, especially when the number of trials (samples) is limited and computational resources are constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the posterior-based procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Source code is available at https://github.com/mohsenhariri/scorio

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bayesian Dirichlet framework adds credible intervals and claims faster stable rankings than Pass@k, but the i.i.d. fixed-probability assumption is a real soft spot for heterogeneous tasks.

read the letter

Colleague, the main point is a Bayesian framework that uses a Dirichlet prior to get posterior means and credible intervals for model success rates, replacing Pass@k with something that quantifies uncertainty and claims better stability at low sample counts. The paper does a few things right. It derives closed-form expressions for the posterior from the Dirichlet-Multinomial model, which is standard but applied here to unify binary Pass@k and rubric scoring. The theoretical result that the uniform prior posterior mean is order-equivalent to average accuracy is direct and useful for understanding why it works. They test the method in simulations with known ground truth success rates and on actual recent contests like AIME'24, AIME'25, HMMT'25, and BrUMO'25, showing faster convergence and more stable ranks. Public code is a good move for adoption. Where it is softer is the assumption that all trials for a model are i.i.d. from one fixed probability. Competition problems vary in difficulty, so the data generating process is likely a mixture rather than a single categorical. This could make the credible intervals miscalibrated and the reported gains dependent on the simulation matching the model. Without checks for robustness to heterogeneity, the empirical claims are harder to take at face value. The work targets people doing LLM reasoning evaluations who need reliable comparisons without burning too much compute. It is the kind of paper that deserves a serious referee because it takes a common pain point and gives a concrete, implementable fix with some supporting evidence. Recommendation: send it for peer review.

Referee Report

2 major / 1 minor

Summary. The paper proposes a Bayesian evaluation framework for LLMs that models outcomes as categorical draws under a Dirichlet prior, replacing Pass@k and avg@N with posterior means and credible intervals. It claims that under a uniform prior the posterior mean is order-equivalent to average accuracy, and that the approach yields faster convergence, greater rank stability, and clearer significance tests than Pass@k or variants on both synthetic simulations with known ground-truth rates and real math-competition benchmarks (AIME'24/'25, HMMT'25, BrUMO'25).

Significance. If the central claims hold, the framework would offer a principled, compute-efficient alternative to current LLM benchmarking practice, enabling reliable model comparisons at substantially smaller sample sizes while making uncertainty explicit and extending naturally to rubric-based scoring.

major comments (2)

[Model and Empirical Evaluation] The Dirichlet-Multinomial model (abstract and model section) assumes i.i.d. trials from a single fixed success-probability vector per model. Competition problems exhibit heterogeneous difficulty, so observed outcomes are more plausibly a mixture of Bernoullis; under this misspecification the posterior no longer correctly calibrates uncertainty and the reported gains in convergence and rank stability may be artifacts of the i.i.d. simulation regime used to generate the ground-truth comparisons.
[Empirical Evaluation] The empirical claims of faster convergence and rank stability rest on simulations generated under the same fixed-p i.i.d. regime that the model assumes (abstract). No robustness checks or alternative generative processes (e.g., difficulty-varying mixtures) are reported, leaving open whether the advantage persists on data that violate the modeling assumption.

minor comments (1)

[Abstract] The abstract states that source code is available but does not specify the exact sample sizes, number of models, or statistical tests used to quantify 'faster convergence' and 'greater rank stability' on the named benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below, clarifying the modeling assumptions, the role of simulations versus real benchmarks, and our plans for revisions.

read point-by-point responses

Referee: The Dirichlet-Multinomial model (abstract and model section) assumes i.i.d. trials from a single fixed success-probability vector per model. Competition problems exhibit heterogeneous difficulty, so observed outcomes are more plausibly a mixture of Bernoullis; under this misspecification the posterior no longer correctly calibrates uncertainty and the reported gains in convergence and rank stability may be artifacts of the i.i.d. simulation regime used to generate the ground-truth comparisons.

Authors: We agree that real competition problems have heterogeneous difficulties, implying that observed successes arise from a mixture of Bernoulli distributions rather than i.i.d. draws from a single fixed success probability. Our framework is designed to estimate a model's expected success rate under the distribution of problems encountered in evaluation, with the Dirichlet prior providing regularization that improves estimate stability compared to raw Pass@k or avg@N. While the i.i.d. assumption is an approximation, the posterior mean remains order-equivalent to average accuracy under a uniform prior, and the credible intervals offer a principled way to assess differences. Importantly, the empirical results on AIME'24/'25, HMMT'25, and BrUMO'25 already reflect heterogeneous problem difficulties, and the observed improvements in rank stability and convergence there support the practical utility of the approach beyond the simulation regime. We will add a dedicated discussion subsection on modeling assumptions, potential misspecification effects on uncertainty calibration, and why the method remains useful for ranking even under heterogeneity. revision: partial
Referee: The empirical claims of faster convergence and rank stability rest on simulations generated under the same fixed-p i.i.d. regime that the model assumes (abstract). No robustness checks or alternative generative processes (e.g., difficulty-varying mixtures) are reported, leaving open whether the advantage persists on data that violate the modeling assumption.

Authors: We acknowledge that the primary simulation experiments use a fixed-p i.i.d. generative process to enable exact ground-truth comparisons for convergence analysis. However, the real-benchmark evaluations on AIME, HMMT, and BrUMO inherently involve varying problem difficulties and thus serve as a partial robustness check. To directly address the concern, we will add new simulation experiments in the revised manuscript that generate data from heterogeneous difficulty models (e.g., success probabilities drawn from a Beta distribution or a finite mixture of Bernoullis per problem). These will compare convergence rates, rank stability, and credible-interval coverage against Pass@k and avg@N under misspecification, allowing us to quantify whether the advantages persist. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core derivation applies standard Dirichlet-Multinomial conjugacy to obtain closed-form posterior means and credible intervals from a uniform prior. The stated order-equivalence between the uniform-prior posterior mean and average accuracy (Pass@1) is a direct algebraic consequence of the Beta or Dirichlet update formulas and does not reduce any ranking result to a fitted parameter or self-referential definition. Empirical comparisons on simulations (with known ground-truth rates) and real benchmarks (AIME, HMMT, BrUMO) are presented as external validation rather than by construction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the derivation chain. The framework remains self-contained against external statistical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard conjugate Bayesian updating for categorical data; no new free parameters are introduced beyond the choice of uniform Dirichlet prior, which is a conventional non-informative choice.

axioms (1)

domain assumption Evaluation outcomes are i.i.d. draws from a categorical distribution whose probability vector is fixed for a given model and task.
This modeling assumption enables the Dirichlet-Multinomial conjugate posterior used throughout the framework.

pith-pipeline@v0.9.0 · 5839 in / 1237 out tokens · 40432 ms · 2026-05-18T10:00:47.560219+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
cs.LG 2026-05 unverdicted novelty 7.0

CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 1 Pith paper · 38 internal anchors

[1]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017. URLhttps://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 2020. URLhttps://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

work page 2020
[3]

Stack Overflow Developer Survey 2025: AI and Developer Tools, 2025

StackOverflow. Stack Overflow Developer Survey 2025: AI and Developer Tools, 2025. URLhttps:// survey.stackoverflow.co/2025/ai. Accessed: 2025-09-24

work page 2025
[4]

21 NVIDIA Corporation

Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Cap- stick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, et al. Artificial intelligence index report 2025.arXiv preprint arXiv:2504.07139, 2025. URLhttps://arxiv.org/abs/2504.07139

work page arXiv 2025
[5]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, et al. Holistic evaluation of language models.arXiv:2211.09110, 2022. URL https://arxiv.org/abs/2211.09110

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, et al. Measuring massive multitask language understanding. In International Conference on Learning Representations (ICLR), 2021. URLhttps://arxiv.org/abs/ 2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, et al. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (BIG-bench).arXiv:2206.04615, 2022. URLhttps://arxiv.org/abs/ 2206.04615

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, et al. Scaling laws for neural language models. arXiv:2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2001
[9]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training compute-optimal large language models. arXiv:2203.15556, 2022. URLhttps://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/ pdf?id=_VjQlMeSB_J

work page 2022
[11]

Training language models to follow instruc- tions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, et al. Training language models to follow instruc- tions with human feedback. InAdvances in Neural Information Processing Systems,

work page
[12]

URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ b1efde53be364a73914f58805a001731-Paper-Conference.pdf

work page 2022
[13]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, 2022. URLhttps://arxiv.org/abs/2208.07339

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers, 2022. URLhttps://arxiv.org/abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Learning both weights and connections for efficient neural networks

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural networks. InNeurIPS, 2015. URLhttps://papers.nips.cc/paper/ 5784-learning-both-weights-and-connections-for-efficient-neural-network

work page 2015
[16]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015. URL https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[17]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAtten- tion. InSOSP, 2023. URLhttps://arxiv.org/abs/2309.06180. 15

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization

Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, and Anshumali Shrivastava. Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization. In A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Ad- vances in Neural Information Processing Systems, volume 37, pages 3304–3331. Curran Associates...

work page 2024
[19]

Pqcache: Product quantization-based kvcache for long context llm inference.Proc

Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. Pqcache: Product quantization-based kvcache for long context llm inference.Proc. ACM Manag. Data, 3 (3), June 2025. doi: 10.1145/3725338. URLhttps://doi.org/10.1145/3725338

work page doi:10.1145/3725338 2025
[20]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models.arXiv preprint arXiv:2106.09685, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Deep reinforcement learning from human preferences

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InNeurIPS, 2017. URLhttps://arxiv.org/abs/1706.03741

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. URLhttps://arxiv.org/abs/2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Evaluating large language models trained on code,

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Evaluating large language models trained on code,

work page
[26]

Mathematical Association of America, 2025

American Invitational Mathematics Examination (AIME) — official description. Mathematical Association of America, 2025. URLhttps://maa.org/maa-invitational-competitions/. 15 questions, 3 hours

work page 2025
[27]

A sober look at progress in language model reasoning: Pitfalls and paths to repro- ducibility.arXiv preprint arXiv:2504.07086,

Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, and Matthias Bethge. A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility.arXiv preprint arXiv:2504.07086, 2025. URLhttps://arxiv.org/abs/2504.07086

work page arXiv 2025
[29]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InICLR, 2020. URLhttps://openreview.net/forum?id=rygGQyrFvH. arXiv:1904.09751 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[30]

The Hitchhiker’s Guide to Testing Statistical Significance in NLP

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. The Hitchhiker’s Guide to Testing Statistical Significance in NLP. InACL, pages 1383–1392, 2018. URLhttps://aclanthology.org/P18-1128/

work page 2018
[31]

More accurate tests for the statistical significance of result differences

Alexander Yeh. More accurate tests for the statistical significance of result differences. InCOLING, 2000. URL https://aclanthology.org/C00-2137/

work page 2000
[32]

Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Im- proved reporting of experimental results. InEMNLP-IJCNLP, 2019. URLhttps://aclanthology.org/ D19-1224/

work page 2019
[33]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Yingbo Sheng, and et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.arXiv preprint arXiv:2306.05685, 2023. URLhttps://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Humans or LLMs as the Judge? A Study on Judgement Bias

Guande Chen, Kai Shen, Saurav Shah, and et al. Humans or LLMs as the Judge? A Study on Judgement Bias. InEMNLP, 2024. URLhttps://aclanthology.org/2024.emnlp-main.474.pdf

work page 2024
[35]

Confidence in large language model evaluation: A bayesian approach to limited-sample challenges, 2025

Xiao Xiao, Yu Su, Sijing Zhang, Zhang Chen, Yadong Chen, and Tian Liu. Confidence in large language model evaluation: A bayesian approach to limited-sample challenges, 2025. URLhttps://arxiv.org/abs/ 2504.21303

work page arXiv 2025
[36]

Straightforward bayesian a/b testing with dirichlet posteriors.arXiv preprint arXiv:2508.08077, 2025

Dustin Hayden and Thomas Armitage. Straightforward bayesian a/b testing with dirichlet posteriors.arXiv preprint arXiv:2508.08077, 2025. URLhttps://arxiv.org/abs/2508.08077. 16

work page arXiv 2025
[37]

American invitational mathematics examination (aime).https:// maa.org/maa-invitational-competitions/, 2024

Mathematical Association of America. American invitational mathematics examination (aime).https:// maa.org/maa-invitational-competitions/, 2024. Official MAA page for the AIME competition (covers AIME 2024)

work page 2024
[38]

American invitational mathematics examination (aime).https:// maa.org/maa-invitational-competitions/, 2025

Mathematical Association of America. American invitational mathematics examination (aime).https:// maa.org/maa-invitational-competitions/, 2025. Official MAA page for the AIME competition (covers AIME 2025)

work page 2025
[39]

Hmmt february 2025 archive (problems and solutions).https:// www.hmmt.org/www/archive/282, 2025

Harvard–MIT Mathematics Tournament. Hmmt february 2025 archive (problems and solutions).https:// www.hmmt.org/www/archive/282, 2025. Official HMMT archive page for February 2025 competition

work page 2025
[40]

Brown university math olympiad (brumo).https://www

Brown University Math Olympiad Organizers. Brown university math olympiad (brumo).https://www. brumo.org/tournament-info, 2025. Official BrUMO website with tournament information (Apr 4–5, 2025)

work page 2025
[41]

Leveraging LLM Inconsistency to Boost Pass@ k Performance.arXiv preprint arXiv:2505.12938, 2025

Uri Dalal, Meirav Segal, Zvika Ben-Haim, Dan Lahav, and Omer Nevo. Leveraging LLM Inconsistency to Boost Pass@ k Performance.arXiv preprint arXiv:2505.12938, 2025. URLhttps://arxiv.org/abs/2505. 12938

work page arXiv 2025
[42]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URLhttps://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[43]

Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems

Brendan Leigh Ross, Noel V ouitsis, Atiyeh Ashari Ghomi, Rasa Hosseinzadeh, Ji Xin, Zhaoyan Liu, Yi Sui, Shiyi Hou, Kin Kwan Leung, Gabriel Loaiza-Ganem, et al. Textual Bayes: Quantifying Uncertainty in LLM- Based Systems.arXiv preprint arXiv:2506.10060, 2025. URLhttps://arxiv.org/abs/2506.10060

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency.arXiv preprint arXiv:2502.04964, 2025

Roman Vashurin, Maiya Goloburda, Albina Ilina, Aleksandr Rubashevskii, Preslav Nakov, Artem Shelmanov, and Maxim Panov. Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency.arXiv preprint arXiv:2502.04964, 2025. URLhttps://arxiv.org/abs/2502.04964

work page arXiv 2025
[45]

Cambridge university press, 2003

Edwin T Jaynes.Probability theory: The logic of science. Cambridge university press, 2003

work page 2003
[46]

Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints.arXiv preprint arXiv:2503.01747, 2025

Sam Bowyer, Laurence Aitchison, and Desi R Ivanova. Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints.arXiv preprint arXiv:2503.01747, 2025. URLhttps://arxiv.org/ abs/2503.01747

work page arXiv 2025
[47]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187, 2024. URLhttps://arxiv.org/abs/2412.21187

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Wong, Songyang Zhang, and Kai Chen

Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F. Wong, Songyang Zhang, and Kai Chen. Compassverifier: A unified and robust verifier for llms evalua- tion and outcome reward, 2025. URLhttps://arxiv.org/abs/2508.03686

work page arXiv 2025
[49]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. URLhttps://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving

Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tom- czak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 7821–7846. Curran Associates, Inc., 2024. URLhttps...

work page 2024
[52]

Tinygsm: achieving ¿80% on gsm8k with small language models

Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. TinyGSM: Achieving 80% on GSM8K with small language models, 2023. URLhttps: //arxiv.org/abs/2312.09241. 17

work page arXiv 2023
[53]

Self-explore: Enhancing mathematical reasoning in large language models by finding the first pit

Hyeongdon Hwang et al. Self-explore: Enhancing mathematical reasoning in large language models by finding the first pit. InFindings of EMNLP, 2024. URLhttps://aclanthology.org/2024. findings-emnlp.78/

work page 2024
[54]

Weak-to-strong reasoning

Yan Yang et al. Weak-to-strong reasoning. InFindings of EMNLP, 2024. URLhttps://aclanthology. org/2024.findings-emnlp.490/

work page 2024
[55]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URLhttps: //arxiv.org/abs/2501.19393

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning, 2025

Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, and Shaul Druckmann. Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning, 2025. URLhttps:// arxiv.org/abs/2502.07154

work page arXiv 2025
[57]

Are your llms capable of stable reasoning?arXiv preprint arXiv:2412.13147, 2024

Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your llms capable of stable reasoning?, 2025. URLhttps://arxiv.org/abs/ 2412.13147

work page arXiv 2025
[58]

Exaone deep: Reasoning enhanced language models, 2025

LG AI Research, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyun...

work page arXiv 2025
[59]

Effective red- teaming of policy-adherent agents, 2025

Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, and Ateret Anaby-Tavor. Effective red- teaming of policy-adherent agents, 2025. URLhttps://arxiv.org/abs/2506.09600

work page arXiv 2025
[60]

Trojanpuzzle: Covertly poisoning code-suggestion models,

Hojjat Aghakhani, Wei Dai, Andre Manoel, Xavier Fernandes, Anant Kharkar, Christopher Kruegel, Giovanni Vigna, David Evans, Ben Zorn, and Robert Sim. Trojanpuzzle: Covertly poisoning code-suggestion models,

work page
[61]

URLhttps://arxiv.org/abs/2301.02344

work page arXiv
[62]

LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem.arXiv preprint arXiv:2403.00108, 2024

Hongyi Liu, Shaochen Zhong, Xintong Sun, Minghao Tian, Mohsen Hariri, Zirui Liu, Ruixiang Tang, Zhimeng Jiang, Jiayi Yuan, Yu-Neng Chuang, et al. LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem.arXiv preprint arXiv:2403.00108, 2024. URLhttps://arxiv.org/abs/2403.00108

work page arXiv 2024
[63]

An llm-assisted easy-to-trigger backdoor attack on code completion models: Injecting disguised vulnerabilities against strong detection, 2024

Shenao Yan, Shen Wang, Yue Duan, Hanbin Hong, Kiho Lee, Doowon Kim, and Yuan Hong. An llm-assisted easy-to-trigger backdoor attack on code completion models: Injecting disguised vulnerabilities against strong detection, 2024. URLhttps://arxiv.org/abs/2406.06822

work page arXiv 2024
[64]

Rtl-breaker: Assessing the security of llms against backdoor attacks on hdl code generation, 2024

Lakshmi Likhitha Mankali, Jitendra Bhandari, Manaar Alam, Ramesh Karri, Michail Maniatakos, Ozgur Sinanoglu, and Johann Knechtel. Rtl-breaker: Assessing the security of llms against backdoor attacks on hdl code generation, 2024. URLhttps://arxiv.org/abs/2411.17569

work page arXiv 2024
[65]

How do large language monkeys get their power (laws)?, 2025

Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, and Sanmi Koyejo. How do large language monkeys get their power (laws)?, 2025. URLhttps://arxiv.org/abs/2502.17578

work page arXiv 2025
[66]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent- user interaction in real-world domains.arXiv preprint, 2024. doi: 10.48550/arXiv.2406.12045. URLhttps: //doi.org/10.48550/arXiv.2406.12045. Introduces the pass k metric

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12045 2024
[67]

Are your llms capable of stable reasoning? InFindings of ACL, 2025

Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your llms capable of stable reasoning? InFindings of ACL, 2025. URLhttps: //aclanthology.org/2025.findings-acl.905/. Camera-ready version detailing G-Pass@k τ and mG-Pass

work page 2025
[68]

Lessons from the Trenches on Reproducible Evaluation of Language Models

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y . Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

URLhttps://arxiv.org/abs/2505.21972

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Prediction- powered inference.Science, 382(6671):669–674, 2023

Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction- powered inference.Science, 382(6671):669–674, 2023

work page 2023
[72]

Reliable confidence intervals for information retrieval evaluation using generative ai

Harrie Oosterhuis, Rolf Jagerman, Zhen Qin, Xuanhui Wang, and Michael Bendersky. Reliable confidence intervals for information retrieval evaluation using generative ai. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2307–2317, 2024

work page 2024
[73]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019. URLhttps://arxiv.org/abs/1910.03771

work page internal anchor Pith review Pith/arXiv arXiv 1910
[74]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[75]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024. URLhttps://arxiv.org/abs/2407.21787

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

doi:10.3115/1073083.1073135 , editor =

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors,Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computa...

work page doi:10.3115/1073083.1073135 2002
[77]

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambro- sio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297, 2020. URLhttps://arxiv.org/abs/2009.10297

work page internal anchor Pith review Pith/arXiv arXiv 2009
[78]

SPoC: Search-based Pseudocode to Code

Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. Spoc: Search-based pseudocode to code.Advances in Neural Information Processing Systems, 32, 2019. URL https://arxiv.org/abs/1906.04908

work page internal anchor Pith review Pith/arXiv arXiv 2019
[80]

URLhttps://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv
[81]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[82]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022. URLhttps://arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2022
[83]

Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Am- brose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

work page 2022
[84]

M. G. KENDALL. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 06 1938. ISSN 0006-3444. doi: 10.1093/biomet/30.1-2.81. URLhttps://doi.org/10.1093/biomet/30.1-2.81

work page doi:10.1093/biomet/30.1-2.81 1938
[85]

Think less, achieve more: Cut reasoning costs by 50 URLhttps://novasky-ai

NovaSky Team. Think less, achieve more: Cut reasoning costs by 50 URLhttps://novasky-ai. github.io/posts/reduce-overthinking. Accessed: 2025-01-23

work page 2025

Showing first 80 references.

[1] [1]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017. URLhttps://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 2020. URLhttps://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

work page 2020

[3] [3]

Stack Overflow Developer Survey 2025: AI and Developer Tools, 2025

StackOverflow. Stack Overflow Developer Survey 2025: AI and Developer Tools, 2025. URLhttps:// survey.stackoverflow.co/2025/ai. Accessed: 2025-09-24

work page 2025

[4] [4]

21 NVIDIA Corporation

Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Cap- stick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, et al. Artificial intelligence index report 2025.arXiv preprint arXiv:2504.07139, 2025. URLhttps://arxiv.org/abs/2504.07139

work page arXiv 2025

[5] [5]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, et al. Holistic evaluation of language models.arXiv:2211.09110, 2022. URL https://arxiv.org/abs/2211.09110

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, et al. Measuring massive multitask language understanding. In International Conference on Learning Representations (ICLR), 2021. URLhttps://arxiv.org/abs/ 2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, et al. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (BIG-bench).arXiv:2206.04615, 2022. URLhttps://arxiv.org/abs/ 2206.04615

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, et al. Scaling laws for neural language models. arXiv:2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2001

[9] [9]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training compute-optimal large language models. arXiv:2203.15556, 2022. URLhttps://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/ pdf?id=_VjQlMeSB_J

work page 2022

[11] [11]

Training language models to follow instruc- tions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, et al. Training language models to follow instruc- tions with human feedback. InAdvances in Neural Information Processing Systems,

work page

[12] [12]

URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ b1efde53be364a73914f58805a001731-Paper-Conference.pdf

work page 2022

[13] [13]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, 2022. URLhttps://arxiv.org/abs/2208.07339

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers, 2022. URLhttps://arxiv.org/abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

Learning both weights and connections for efficient neural networks

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural networks. InNeurIPS, 2015. URLhttps://papers.nips.cc/paper/ 5784-learning-both-weights-and-connections-for-efficient-neural-network

work page 2015

[16] [16]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015. URL https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015

[17] [17]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAtten- tion. InSOSP, 2023. URLhttps://arxiv.org/abs/2309.06180. 15

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization

Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, and Anshumali Shrivastava. Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization. In A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Ad- vances in Neural Information Processing Systems, volume 37, pages 3304–3331. Curran Associates...

work page 2024

[19] [19]

Pqcache: Product quantization-based kvcache for long context llm inference.Proc

Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. Pqcache: Product quantization-based kvcache for long context llm inference.Proc. ACM Manag. Data, 3 (3), June 2025. doi: 10.1145/3725338. URLhttps://doi.org/10.1145/3725338

work page doi:10.1145/3725338 2025

[20] [20]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models.arXiv preprint arXiv:2106.09685, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

Deep reinforcement learning from human preferences

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InNeurIPS, 2017. URLhttps://arxiv.org/abs/1706.03741

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [23]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. URLhttps://arxiv.org/abs/2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [24]

Evaluating large language models trained on code,

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Evaluating large language models trained on code,

work page

[24] [26]

Mathematical Association of America, 2025

American Invitational Mathematics Examination (AIME) — official description. Mathematical Association of America, 2025. URLhttps://maa.org/maa-invitational-competitions/. 15 questions, 3 hours

work page 2025

[25] [27]

A sober look at progress in language model reasoning: Pitfalls and paths to repro- ducibility.arXiv preprint arXiv:2504.07086,

Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, and Matthias Bethge. A sober look at progress in language model reasoning: Pitfalls and paths to reproducibility.arXiv preprint arXiv:2504.07086, 2025. URLhttps://arxiv.org/abs/2504.07086

work page arXiv 2025

[26] [29]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InICLR, 2020. URLhttps://openreview.net/forum?id=rygGQyrFvH. arXiv:1904.09751 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[27] [30]

The Hitchhiker’s Guide to Testing Statistical Significance in NLP

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. The Hitchhiker’s Guide to Testing Statistical Significance in NLP. InACL, pages 1383–1392, 2018. URLhttps://aclanthology.org/P18-1128/

work page 2018

[28] [31]

More accurate tests for the statistical significance of result differences

Alexander Yeh. More accurate tests for the statistical significance of result differences. InCOLING, 2000. URL https://aclanthology.org/C00-2137/

work page 2000

[29] [32]

Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Im- proved reporting of experimental results. InEMNLP-IJCNLP, 2019. URLhttps://aclanthology.org/ D19-1224/

work page 2019

[30] [33]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Yingbo Sheng, and et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.arXiv preprint arXiv:2306.05685, 2023. URLhttps://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [34]

Humans or LLMs as the Judge? A Study on Judgement Bias

Guande Chen, Kai Shen, Saurav Shah, and et al. Humans or LLMs as the Judge? A Study on Judgement Bias. InEMNLP, 2024. URLhttps://aclanthology.org/2024.emnlp-main.474.pdf

work page 2024

[32] [35]

Confidence in large language model evaluation: A bayesian approach to limited-sample challenges, 2025

Xiao Xiao, Yu Su, Sijing Zhang, Zhang Chen, Yadong Chen, and Tian Liu. Confidence in large language model evaluation: A bayesian approach to limited-sample challenges, 2025. URLhttps://arxiv.org/abs/ 2504.21303

work page arXiv 2025

[33] [36]

Straightforward bayesian a/b testing with dirichlet posteriors.arXiv preprint arXiv:2508.08077, 2025

Dustin Hayden and Thomas Armitage. Straightforward bayesian a/b testing with dirichlet posteriors.arXiv preprint arXiv:2508.08077, 2025. URLhttps://arxiv.org/abs/2508.08077. 16

work page arXiv 2025

[34] [37]

American invitational mathematics examination (aime).https:// maa.org/maa-invitational-competitions/, 2024

Mathematical Association of America. American invitational mathematics examination (aime).https:// maa.org/maa-invitational-competitions/, 2024. Official MAA page for the AIME competition (covers AIME 2024)

work page 2024

[35] [38]

American invitational mathematics examination (aime).https:// maa.org/maa-invitational-competitions/, 2025

Mathematical Association of America. American invitational mathematics examination (aime).https:// maa.org/maa-invitational-competitions/, 2025. Official MAA page for the AIME competition (covers AIME 2025)

work page 2025

[36] [39]

Hmmt february 2025 archive (problems and solutions).https:// www.hmmt.org/www/archive/282, 2025

Harvard–MIT Mathematics Tournament. Hmmt february 2025 archive (problems and solutions).https:// www.hmmt.org/www/archive/282, 2025. Official HMMT archive page for February 2025 competition

work page 2025

[37] [40]

Brown university math olympiad (brumo).https://www

Brown University Math Olympiad Organizers. Brown university math olympiad (brumo).https://www. brumo.org/tournament-info, 2025. Official BrUMO website with tournament information (Apr 4–5, 2025)

work page 2025

[38] [41]

Leveraging LLM Inconsistency to Boost Pass@ k Performance.arXiv preprint arXiv:2505.12938, 2025

Uri Dalal, Meirav Segal, Zvika Ben-Haim, Dan Lahav, and Omer Nevo. Leveraging LLM Inconsistency to Boost Pass@ k Performance.arXiv preprint arXiv:2505.12938, 2025. URLhttps://arxiv.org/abs/2505. 12938

work page arXiv 2025

[39] [42]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URLhttps://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021

[40] [43]

Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems

Brendan Leigh Ross, Noel V ouitsis, Atiyeh Ashari Ghomi, Rasa Hosseinzadeh, Ji Xin, Zhaoyan Liu, Yi Sui, Shiyi Hou, Kin Kwan Leung, Gabriel Loaiza-Ganem, et al. Textual Bayes: Quantifying Uncertainty in LLM- Based Systems.arXiv preprint arXiv:2506.10060, 2025. URLhttps://arxiv.org/abs/2506.10060

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [44]

Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency.arXiv preprint arXiv:2502.04964, 2025

Roman Vashurin, Maiya Goloburda, Albina Ilina, Aleksandr Rubashevskii, Preslav Nakov, Artem Shelmanov, and Maxim Panov. Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency.arXiv preprint arXiv:2502.04964, 2025. URLhttps://arxiv.org/abs/2502.04964

work page arXiv 2025

[42] [45]

Cambridge university press, 2003

Edwin T Jaynes.Probability theory: The logic of science. Cambridge university press, 2003

work page 2003

[43] [46]

Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints.arXiv preprint arXiv:2503.01747, 2025

Sam Bowyer, Laurence Aitchison, and Desi R Ivanova. Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints.arXiv preprint arXiv:2503.01747, 2025. URLhttps://arxiv.org/ abs/2503.01747

work page arXiv 2025

[44] [47]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187, 2024. URLhttps://arxiv.org/abs/2412.21187

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [48]

Wong, Songyang Zhang, and Kai Chen

Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F. Wong, Songyang Zhang, and Kai Chen. Compassverifier: A unified and robust verifier for llms evalua- tion and outcome reward, 2025. URLhttps://arxiv.org/abs/2508.03686

work page arXiv 2025

[46] [49]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. URLhttps://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [50]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [51]

Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving

Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tom- czak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 7821–7846. Curran Associates, Inc., 2024. URLhttps...

work page 2024

[49] [52]

Tinygsm: achieving ¿80% on gsm8k with small language models

Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. TinyGSM: Achieving 80% on GSM8K with small language models, 2023. URLhttps: //arxiv.org/abs/2312.09241. 17

work page arXiv 2023

[50] [53]

Self-explore: Enhancing mathematical reasoning in large language models by finding the first pit

Hyeongdon Hwang et al. Self-explore: Enhancing mathematical reasoning in large language models by finding the first pit. InFindings of EMNLP, 2024. URLhttps://aclanthology.org/2024. findings-emnlp.78/

work page 2024

[51] [54]

Weak-to-strong reasoning

Yan Yang et al. Weak-to-strong reasoning. InFindings of EMNLP, 2024. URLhttps://aclanthology. org/2024.findings-emnlp.490/

work page 2024

[52] [55]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URLhttps: //arxiv.org/abs/2501.19393

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [56]

Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning, 2025

Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, and Shaul Druckmann. Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning, 2025. URLhttps:// arxiv.org/abs/2502.07154

work page arXiv 2025

[54] [57]

Are your llms capable of stable reasoning?arXiv preprint arXiv:2412.13147, 2024

Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your llms capable of stable reasoning?, 2025. URLhttps://arxiv.org/abs/ 2412.13147

work page arXiv 2025

[55] [58]

Exaone deep: Reasoning enhanced language models, 2025

LG AI Research, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyun...

work page arXiv 2025

[56] [59]

Effective red- teaming of policy-adherent agents, 2025

Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, and Ateret Anaby-Tavor. Effective red- teaming of policy-adherent agents, 2025. URLhttps://arxiv.org/abs/2506.09600

work page arXiv 2025

[57] [60]

Trojanpuzzle: Covertly poisoning code-suggestion models,

Hojjat Aghakhani, Wei Dai, Andre Manoel, Xavier Fernandes, Anant Kharkar, Christopher Kruegel, Giovanni Vigna, David Evans, Ben Zorn, and Robert Sim. Trojanpuzzle: Covertly poisoning code-suggestion models,

work page

[58] [61]

URLhttps://arxiv.org/abs/2301.02344

work page arXiv

[59] [62]

LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem.arXiv preprint arXiv:2403.00108, 2024

Hongyi Liu, Shaochen Zhong, Xintong Sun, Minghao Tian, Mohsen Hariri, Zirui Liu, Ruixiang Tang, Zhimeng Jiang, Jiayi Yuan, Yu-Neng Chuang, et al. LoRATK: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem.arXiv preprint arXiv:2403.00108, 2024. URLhttps://arxiv.org/abs/2403.00108

work page arXiv 2024

[60] [63]

An llm-assisted easy-to-trigger backdoor attack on code completion models: Injecting disguised vulnerabilities against strong detection, 2024

Shenao Yan, Shen Wang, Yue Duan, Hanbin Hong, Kiho Lee, Doowon Kim, and Yuan Hong. An llm-assisted easy-to-trigger backdoor attack on code completion models: Injecting disguised vulnerabilities against strong detection, 2024. URLhttps://arxiv.org/abs/2406.06822

work page arXiv 2024

[61] [64]

Rtl-breaker: Assessing the security of llms against backdoor attacks on hdl code generation, 2024

Lakshmi Likhitha Mankali, Jitendra Bhandari, Manaar Alam, Ramesh Karri, Michail Maniatakos, Ozgur Sinanoglu, and Johann Knechtel. Rtl-breaker: Assessing the security of llms against backdoor attacks on hdl code generation, 2024. URLhttps://arxiv.org/abs/2411.17569

work page arXiv 2024

[62] [65]

How do large language monkeys get their power (laws)?, 2025

Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, and Sanmi Koyejo. How do large language monkeys get their power (laws)?, 2025. URLhttps://arxiv.org/abs/2502.17578

work page arXiv 2025

[63] [66]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent- user interaction in real-world domains.arXiv preprint, 2024. doi: 10.48550/arXiv.2406.12045. URLhttps: //doi.org/10.48550/arXiv.2406.12045. Introduces the pass k metric

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12045 2024

[64] [67]

Are your llms capable of stable reasoning? InFindings of ACL, 2025

Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your llms capable of stable reasoning? InFindings of ACL, 2025. URLhttps: //aclanthology.org/2025.findings-acl.905/. Camera-ready version detailing G-Pass@k τ and mG-Pass

work page 2025

[65] [68]

Lessons from the Trenches on Reproducible Evaluation of Language Models

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y . Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [70]

URLhttps://arxiv.org/abs/2505.21972

work page internal anchor Pith review Pith/arXiv arXiv

[67] [71]

Prediction- powered inference.Science, 382(6671):669–674, 2023

Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction- powered inference.Science, 382(6671):669–674, 2023

work page 2023

[68] [72]

Reliable confidence intervals for information retrieval evaluation using generative ai

Harrie Oosterhuis, Rolf Jagerman, Zhen Qin, Xuanhui Wang, and Michael Bendersky. Reliable confidence intervals for information retrieval evaluation using generative ai. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2307–2317, 2024

work page 2024

[69] [73]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019. URLhttps://arxiv.org/abs/1910.03771

work page internal anchor Pith review Pith/arXiv arXiv 1910

[70] [74]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023

[71] [75]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024. URLhttps://arxiv.org/abs/2407.21787

work page internal anchor Pith review Pith/arXiv arXiv 2024

[72] [76]

doi:10.3115/1073083.1073135 , editor =

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors,Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computa...

work page doi:10.3115/1073083.1073135 2002

[73] [77]

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambro- sio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297, 2020. URLhttps://arxiv.org/abs/2009.10297

work page internal anchor Pith review Pith/arXiv arXiv 2009

[74] [78]

SPoC: Search-based Pseudocode to Code

Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. Spoc: Search-based pseudocode to code.Advances in Neural Information Processing Systems, 32, 2019. URL https://arxiv.org/abs/1906.04908

work page internal anchor Pith review Pith/arXiv arXiv 2019

[75] [80]

URLhttps://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv

[76] [81]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[77] [82]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022. URLhttps://arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2022

[78] [83]

Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Am- brose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

work page 2022

[79] [84]

M. G. KENDALL. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 06 1938. ISSN 0006-3444. doi: 10.1093/biomet/30.1-2.81. URLhttps://doi.org/10.1093/biomet/30.1-2.81

work page doi:10.1093/biomet/30.1-2.81 1938

[80] [85]

Think less, achieve more: Cut reasoning costs by 50 URLhttps://novasky-ai

NovaSky Team. Think less, achieve more: Cut reasoning costs by 50 URLhttps://novasky-ai. github.io/posts/reduce-overthinking. Accessed: 2025-01-23

work page 2025