MINCE: Shrinking LLM Evaluation Datasets via Few-Model Monte Carlo Calibration

Ashish Sirasao; Devleena Das; Elliott Delaye; Nithin Kumar Guggilla; Rajeev Patwari; Vikram Kumar Bukka

arxiv: 2606.22826 · v1 · pith:23VQDF27new · submitted 2026-06-22 · 💻 cs.AI

MINCE: Shrinking LLM Evaluation Datasets via Few-Model Monte Carlo Calibration

Devleena Das , Rajeev Patwari , Vikram Kumar Bukka , Nithin Kumar Guggilla , Elliott Delaye , Ashish Sirasao This is my paper

Pith reviewed 2026-06-26 08:50 UTC · model grok-4.3

classification 💻 cs.AI

keywords MINCELLM evaluationdataset reductionMonte Carlo calibrationaccuracy driftbenchmark subsettingfew-model calibrationevaluation speedup

0 comments

The pith

MINCE uses Monte Carlo simulation on few calibration models to shrink LLM evaluation datasets by 54-89% while bounding accuracy drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MINCE to cut the repeated cost of running large benchmarks on many LLM variants, especially on edge hardware where each run can take hours. It collects per-item correctness logs from a small set of calibration models, runs Monte Carlo simulation to find the smallest subset size that keeps accuracy drift below a chosen bound, and then fixes one random subset of that size for all future evaluations. No learned predictor is required. The method delivers the reported size cuts on IFEVAL, MMLU and GSM8K together with low observed drift on both BF16 and held-out NPU models, plus speedups of several times and better results than tinyBenchmarks while using far fewer calibration models. If the underlying representativeness holds, evaluation of model families becomes dramatically cheaper without material loss of reliability.

Core claim

MINCE determines the minimum subset size via Monte Carlo simulation on per-item logs from a small calibration pool to guarantee bounded accuracy drift, then uses a fixed random sample of that size for evaluation, achieving reductions of 54% on IFEVAL, 89% on MMLU, and 70% on GSM8K with maximum drift of 2.62 percentage points on BF16 models and mean drifts of 0.77-3.59 on NPU models, along with speedups of 2.7-8.1x on GPU and 1.7-2.0x on NPU, while being robust to pool size and using far fewer calibration models than alternatives.

What carries the argument

Monte Carlo simulation over per-item correctness logs from a small set of calibration models to compute the minimum subset size bounding accuracy drift.

If this is right

Evaluation time for model variants drops by factors of 2-8x on both GPU and NPU hardware.
Accuracy drift stays at or below 2.62 pp on the tested model classes when the computed size is used.
The same subset works across quantization levels and fine-tunes once the size is fixed.
Drift remains lower than tinyBenchmarks even though 57x fewer calibration models are needed.
Performance is stable across different sizes of the calibration pool.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Monte Carlo sizing step could be applied to other instruction-following or reasoning benchmarks not examined here.
Running the calibration step once per hardware class might let teams maintain separate compact subsets for CPU, GPU, and NPU deployments.
If the representativeness assumption weakens for very different model families, the method could be extended with a small per-family recalibration step.

Load-bearing premise

Per-item correctness logs from a small calibration set of models are sufficiently representative that Monte Carlo simulation can reliably set a subset size guaranteeing low drift for arbitrary new models and hardware.

What would settle it

A held-out model whose accuracy on the MINCE-chosen subset drifts more than the bound computed from the calibration logs.

Figures

Figures reproduced from arXiv: 2606.22826 by Ashish Sirasao, Devleena Das, Elliott Delaye, Nithin Kumar Guggilla, Rajeev Patwari, Vikram Kumar Bukka.

**Figure 2.** Figure 2: Marginal gain in P95 drift reduction at each candidate subset size. Each curve represents one model; the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy drift distributions for uniform, stratified, and k-means sampling strategies at [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Per-model accuracy drift comparison between tinyBenchmarks and MINCE, showing that MINCE reduces [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Selected subset size n ∗ as a function of the marginal-gain threshold τ for each benchmark. Flat segments indicate τ ranges over which n ∗ is unchanged. Three observations support the choice of τ=1 pp. First, as shown in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Evaluating LLMs across many model variants -- quantized, fine-tuned, or deployment-specific -- requires running large benchmarks repeatedly, a process that can take tens of hours per model on edge hardware such as NPUs. Existing subset selection methods reduce this cost but depend on large calibration pools or learned prediction layers. We introduce MINCE (Monte Carlo Informed N-sizing for Compact Evaluation), which uses Monte Carlo simulation over per-item logs from a small set of calibration models to find the minimum subset size that bounds accuracy drift and then fixes a randomly sampled subset at that size, with no prediction layer needed. MINCE reduces IFEVAL by 54\%, MMLU by 89\%, and GSM8K by 70\% with maximum drift $\leq$2.62\,pp on BF16 models and mean drift of 0.77--3.59\,pp on held-out NPU models, while delivering median GPU evaluation speedups of 2.7--8.1$\times$ and NPU evaluation speedups of 1.7--2.0$\times$. The method is robust to calibration pool size and achieves lower drift than tinyBenchmarks (12$\times$ lower on MMLU, 3.3$\times$ on GSM8K) while using 57$\times$ fewer calibration models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MINCE sets subset sizes via Monte Carlo on a few calibration models' logs then samples randomly, but the drift bounds hinge on those logs representing future models' error patterns.

read the letter

MINCE determines a minimum subset size for benchmarks like IFEVAL, MMLU, and GSM8K by running Monte Carlo simulations on per-item correctness logs from a small set of calibration models, then draws a random subset of that size. No learned prediction layer is involved, and it uses far fewer calibration models than tinyBenchmarks.

The paper reports clear size cuts—54% on IFEVAL, 89% on MMLU, 70% on GSM8K—with max drift of 2.62 pp on BF16 models and mean drift of 0.77–3.59 pp on held-out NPU models, plus median speedups of 2.7–8.1× on GPU and 1.7–2.0× on NPU. It also claims lower drift than tinyBenchmarks at 57× lower calibration cost and robustness to pool size. Those empirical comparisons are the useful part.

The soft spot is the assumption that the calibration logs capture enough of the item-wise error correlations to make the simulated bounds reliable for arbitrary held-out models. If a new model has different biases or correlations not seen in the pool, the chosen size could let drift exceed the reported numbers. The abstract gives no theoretical coverage argument, so the held-out NPU results are the main evidence, and they need close checking.

This is for groups that evaluate many model variants on constrained hardware and want a lightweight way to shrink repeated runs. Readers working on benchmark efficiency will find the direct numbers and comparisons worth seeing.

It deserves peer review because the claims are specific and the procedure is testable. Send it to referees.

Referee Report

3 major / 2 minor

Summary. The paper introduces MINCE, which performs Monte Carlo simulation on per-item correctness logs collected from a small calibration pool of models to determine the smallest fixed subset size that bounds accuracy drift on benchmarks (IFEVAL, MMLU, GSM8K), then draws one such subset without any learned predictor. It reports concrete reductions (54% IFEVAL, 89% MMLU, 70% GSM8K) together with maximum drift ≤2.62 pp on BF16 models and mean drift 0.77–3.59 pp on held-out NPU models, plus median speedups of 2.7–8.1 imes (GPU) and 1.7–2.0 imes (NPU), while claiming robustness to pool size and superiority to tinyBenchmarks with 57 imes fewer calibration models.

Significance. If the Monte Carlo calibration procedure and its generalization claims are shown to be statistically sound, the work would provide a lightweight, prediction-layer-free route to compact evaluation sets that could materially reduce repeated benchmarking costs on edge hardware; the reported factor-of-57 reduction in calibration models relative to prior subset methods would be a notable practical advantage.

major comments (3)

[Abstract] Abstract and method description: the concrete drift bounds (≤2.62 pp max, 0.77–3.59 pp mean) and subset-size claims are stated without any description of the Monte Carlo procedure itself (number of draws, how tail probabilities are estimated, or the precise stopping rule that converts simulated drift into a size guarantee), rendering the numerical results unverifiable from the given information.
[Method / Calibration pool] § on calibration and held-out evaluation: the central assumption that per-item correctness logs from the small calibration pool are representative of arbitrary future models is asserted but not supported by any diversity argument, coverage test, or worst-case analysis; if a held-out model exhibits different item-wise error correlations (different architecture, quantization, or fine-tuning), the simulated size may fail to bound empirical drift.
[Experiments / Comparison] Results tables comparing to tinyBenchmarks: the reported 12 imes and 3.3 imes lower drift figures are presented without the exact calibration-pool sizes, number of Monte Carlo trials, or variance estimates used for the comparison, so it is impossible to assess whether the superiority claim is load-bearing or sensitive to those choices.

minor comments (2)

[Notation] Notation for percentage-point drift and speedup factors is used inconsistently between abstract and later sections; a single definition table would improve clarity.
[Experimental setup] The manuscript does not state the exact number of calibration models or the identity of the held-out NPU models, which are needed to reproduce the robustness-to-pool-size experiments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies important areas for improving clarity and supporting claims. We address each major comment below, proposing specific revisions to the manuscript where details are missing or assumptions require further elaboration.

read point-by-point responses

Referee: [Abstract] Abstract and method description: the concrete drift bounds (≤2.62 pp max, 0.77–3.59 pp mean) and subset-size claims are stated without any description of the Monte Carlo procedure itself (number of draws, how tail probabilities are estimated, or the precise stopping rule that converts simulated drift into a size guarantee), rendering the numerical results unverifiable from the given information.

Authors: We agree that the Monte Carlo procedure requires explicit description to ensure verifiability. In the revised manuscript we will expand the Method section to specify: 10,000 Monte Carlo draws per size trial, empirical quantile estimation for tail probabilities, and the stopping rule that selects the smallest size guaranteeing the target drift bound holds in at least 95% of simulated trials. Corresponding details will also be added to the abstract for completeness. revision: yes
Referee: [Method / Calibration pool] § on calibration and held-out evaluation: the central assumption that per-item correctness logs from the small calibration pool are representative of arbitrary future models is asserted but not supported by any diversity argument, coverage test, or worst-case analysis; if a held-out model exhibits different item-wise error correlations (different architecture, quantization, or fine-tuning), the simulated size may fail to bound empirical drift.

Authors: The assumption receives empirical support from the held-out NPU evaluations showing low drift. We will add a subsection describing the calibration pool's composition (models spanning multiple families, bit-widths, and fine-tuning regimes) together with a simple coverage check on error-pattern diversity. A formal worst-case analysis lies outside the paper's scope; we will instead note this as a limitation and emphasize that the method's practical utility is demonstrated by the reported held-out results rather than claimed as universally guaranteed. revision: partial
Referee: [Experiments / Comparison] Results tables comparing to tinyBenchmarks: the reported 12× and 3.3× lower drift figures are presented without the exact calibration-pool sizes, number of Monte Carlo trials, or variance estimates used for the comparison, so it is impossible to assess whether the superiority claim is load-bearing or sensitive to those choices.

Authors: We acknowledge that these experimental details are necessary for assessing the comparison. The revised version will report the precise calibration-pool sizes (5 models for MINCE), the number of Monte Carlo trials (10,000), and standard-error estimates or confidence intervals on the drift values. These additions will allow readers to evaluate sensitivity and the robustness of the superiority claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives subset sizes and drift bounds via Monte Carlo simulation performed exclusively on per-item correctness logs from a separate calibration pool of models. These logs are distinct from the held-out BF16 and NPU models used to report empirical drift (≤2.62 pp max, 0.77–3.59 pp mean). The final subset is fixed by random sampling at the simulated size; no parameters are fitted to the held-out results, and no self-citation, uniqueness theorem, or ansatz is invoked to justify the bounds. The comparison to tinyBenchmarks is an external empirical benchmark. The derivation chain is therefore self-contained and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to identify concrete free parameters, axioms, or invented entities beyond the general reliance on calibration-model logs being representative.

pith-pipeline@v0.9.1-grok · 5791 in / 1137 out tokens · 30222 ms · 2026-06-26T08:50:45.390628+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 1 canonical work pages

[2]

Balinski and H

Michel L. Balinski and H. Peyton Young. 2010. Fair Representation: Meeting the Ideal of One Man, One Vote, 2nd edition. Brookings Institution Press

2010
[3]

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O'Brien, and 1 others. 2024. https://doi.org/10.5281/zenodo.10256836 A framework for few-shot language model evaluation

work page doi:10.5281/zenodo.10256836 2024
[5]

William G. Cochran. 1977. Sampling Techniques, 3rd edition. John Wiley & Sons

1977
[6]

Cl\' e mentine Fourrier, Nathan Habib, Thomas Wolf, and Lewis Tunstall. 2024. Open LLM leaderboard. Hugging Face

2024
[7]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, and 1 others. 2024. The L lama 3 herd of models. arXiv preprint arXiv:2407.21783

Pith/arXiv arXiv 2024
[8]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. In Proceedings of the International Conference on Learning Representations

2021
[9]

Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, and 1 others

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, and 1 others. 2023. Mistral 7 B . arXiv preprint arXiv:2310.06825

Pith/arXiv arXiv 2023
[10]

Schulze Buschoff, and 1 others

Alex Kipnis, Konstantinos Voudouris, Luca M. Schulze Buschoff, and 1 others. 2025. metabench -- a sparse benchmark of reasoning and knowledge in large language models. In Proceedings of the 13th International Conference on Learning Representations

2025
[11]

Yotam Perlitz, Elron Bandel, Ariel Gera, Ofir Arviv, Michal Shlain, Michal Shmueli-Scheuer, Leshem Choshen, Noam Slonim, and Dafna Sheinwald. 2024. Efficient benchmarking (of language models). In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics

2024
[12]

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. 2024. tiny B enchmarks: evaluating LLM s with fewer examples. In Proceedings of the 41st International Conference on Machine Learning

2024
[13]

Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breeskin, Mark Bughici, Ciro Cebo, and 1 others. 2020. MLPerf inference benchmark. Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA)

2020
[14]

Rubinstein and Dirk P

Reuven Y. Rubinstein and Dirk P. Kroese. 2008. Simulation and the M onte C arlo Method , 2nd edition. John Wiley & Sons

2008
[15]

Shaobo Wang, Cong Wang, Wenjie Fu, Yue Min, Mingquan Feng, Isabel Guan, Xuming Hu, Conghui He, Cunxiang Wang, Kexin Yang, Xingzhang Ren, Fei Huang, Dayiheng Liu, and Linfeng Zhang. 2026. Rethinking LLM evaluation: Can we evaluate LLM s with 200 less data? In Proceedings of the 14th International Conference on Learning Representations

2026
[17]

Peiwen Yuan, Yueqi Zhang, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, and Kan Li. 2025. Beyond one-size-fits-all: Tailored benchmarks for efficient evaluation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

2025
[18]

Taolin Zhang, Hang Guo, Wang Lu, Tao Dai, Shu-Tao Xia, and Jindong Wang. 2026. Sparse E val: Efficient evaluation of large language models by sparse optimization. In Proceedings of the 14th International Conference on Learning Representations

2026
[20]

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2024. Instruction-following evaluation for large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

2024
[21]

Polo, Felipe Maia and Weber, Lucas and Choshen, Leshem and Sun, Yuekai and Xu, Gongjun and Yurochkin, Mikhail , booktitle=. tiny
[22]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics , year=

Efficient Benchmarking (of Language Models) , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics , year=

2024
[23]

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics , year=

Anchor Points: Benchmarking Models with Much Fewer Examples , author=. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics , year=
[24]

Proceedings of the 13th International Conference on Learning Representations , year=

metabench -- A Sparse Benchmark of Reasoning and Knowledge in Large Language Models , author=. Proceedings of the 13th International Conference on Learning Representations , year=
[25]

Rethinking

Wang, Shaobo and Wang, Cong and Fu, Wenjie and Min, Yue and Feng, Mingquan and Guan, Isabel and Hu, Xuming and He, Conghui and Wang, Cunxiang and Yang, Kexin and Ren, Xingzhang and Huang, Fei and Liu, Dayiheng and Zhang, Linfeng , booktitle=. Rethinking
[26]

Zhang, Taolin and Guo, Hang and Lu, Wang and Dai, Tao and Xia, Shu-Tao and Wang, Jindong , booktitle=. Sparse
[27]

Active Evaluation Acquisition for Efficient

Li, Yang and others , booktitle=. Active Evaluation Acquisition for Efficient
[28]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year=

Beyond One-Size-Fits-All: Tailored Benchmarks for Efficient Evaluation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year=
[29]

Zhu, Xuechen and Tang, Kaiqiang and Pan, Yifei and Du, Mengdi and Xu, Jiaze and Jiang, Zhisong and Xie, Pengtao , booktitle=. Sub
[30]

arXiv preprint arXiv:2502.10312 , year=

Scales++: Efficient Benchmark Evaluation at Scale through Item-Centric Assessment of Cognitive Abilities , author=. arXiv preprint arXiv:2502.10312 , year=

arXiv
[31]

arXiv preprint arXiv:2502.07489 , year=

Less is More: A Submodular Approach for Efficient LLM Benchmark Selection , author=. arXiv preprint arXiv:2502.07489 , year=

Pith/arXiv arXiv
[32]

Proceedings of the 42nd International Conference on Machine Learning , year=

Autoeval Done Right: Using Synthetic Data for Model Evaluation , author=. Proceedings of the 42nd International Conference on Machine Learning , year=
[33]

1977 , publisher=

Sampling Techniques , author=. 1977 , publisher=

1977
[34]

Journal of the Royal Statistical Society , volume=

On the Two Different Aspects of the Representative Method , author=. Journal of the Royal Statistical Society , volume=
[35]

and Kroese, Dirk P

Rubinstein, Reuven Y. and Kroese, Dirk P. , edition=. Simulation and the. 2008 , publisher=

2008
[36]

Proceedings of the International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations , year=
[37]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year=

Instruction-Following Evaluation for Large Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year=
[38]

arXiv preprint arXiv:2110.14168 , year=

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv
[39]

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others , journal=
[40]

, journal=

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , journal=
[41]

arXiv preprint arXiv:2501.14249 , year=

Humanity's Last Exam , author=. arXiv preprint arXiv:2501.14249 , year=

Pith/arXiv arXiv
[42]

Transactions on Machine Learning Research , year=

Holistic Evaluation of Language Models , author=. Transactions on Machine Learning Research , year=
[43]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[44]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and others , journal=. The
[45]

arXiv preprint arXiv:2412.19437 , year=

Pith/arXiv arXiv
[46]

and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and others , journal=

Jiang, Albert Q. and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and others , journal=. Mistral 7
[47]

arXiv preprint arXiv:2412.08905 , year=

Phi-4 Technical Report , author=. arXiv preprint arXiv:2412.08905 , year=

Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2506.05176 , year=

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. arXiv preprint arXiv:2506.05176 , year=

Pith/arXiv arXiv
[49]

2024 , publisher=

Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin Gregory and Bradley, Herbie and O'Brien, Kyle and others , title=. 2024 , publisher=

2024
[50]

Fourrier, Cl\'. Open. Hugging Face , year=
[51]

2010 , publisher=

Fair Representation: Meeting the Ideal of One Man, One Vote , author=. 2010 , publisher=

2010
[52]

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , year=

Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo\". Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , year=
[53]

Reddi, Vijay Janapa and Cheng, Christine and Kanter, David and Mattson, Peter and Schmuelling, Guenther and Wu, Carole-Jean and Anderson, Brian and Breeskin, Maximilien and Bughici, Mark and Cebo, Ciro and others , journal=

[1] [2]

Balinski and H

Michel L. Balinski and H. Peyton Young. 2010. Fair Representation: Meeting the Ideal of One Man, One Vote, 2nd edition. Brookings Institution Press

2010

[2] [3]

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O'Brien, and 1 others. 2024. https://doi.org/10.5281/zenodo.10256836 A framework for few-shot language model evaluation

work page doi:10.5281/zenodo.10256836 2024

[3] [5]

William G. Cochran. 1977. Sampling Techniques, 3rd edition. John Wiley & Sons

1977

[4] [6]

Cl\' e mentine Fourrier, Nathan Habib, Thomas Wolf, and Lewis Tunstall. 2024. Open LLM leaderboard. Hugging Face

2024

[5] [7]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, and 1 others. 2024. The L lama 3 herd of models. arXiv preprint arXiv:2407.21783

Pith/arXiv arXiv 2024

[6] [8]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. In Proceedings of the International Conference on Learning Representations

2021

[7] [9]

Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, and 1 others

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, and 1 others. 2023. Mistral 7 B . arXiv preprint arXiv:2310.06825

Pith/arXiv arXiv 2023

[8] [10]

Schulze Buschoff, and 1 others

Alex Kipnis, Konstantinos Voudouris, Luca M. Schulze Buschoff, and 1 others. 2025. metabench -- a sparse benchmark of reasoning and knowledge in large language models. In Proceedings of the 13th International Conference on Learning Representations

2025

[9] [11]

Yotam Perlitz, Elron Bandel, Ariel Gera, Ofir Arviv, Michal Shlain, Michal Shmueli-Scheuer, Leshem Choshen, Noam Slonim, and Dafna Sheinwald. 2024. Efficient benchmarking (of language models). In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics

2024

[10] [12]

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. 2024. tiny B enchmarks: evaluating LLM s with fewer examples. In Proceedings of the 41st International Conference on Machine Learning

2024

[11] [13]

Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breeskin, Mark Bughici, Ciro Cebo, and 1 others. 2020. MLPerf inference benchmark. Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA)

2020

[12] [14]

Rubinstein and Dirk P

Reuven Y. Rubinstein and Dirk P. Kroese. 2008. Simulation and the M onte C arlo Method , 2nd edition. John Wiley & Sons

2008

[13] [15]

Shaobo Wang, Cong Wang, Wenjie Fu, Yue Min, Mingquan Feng, Isabel Guan, Xuming Hu, Conghui He, Cunxiang Wang, Kexin Yang, Xingzhang Ren, Fei Huang, Dayiheng Liu, and Linfeng Zhang. 2026. Rethinking LLM evaluation: Can we evaluate LLM s with 200 less data? In Proceedings of the 14th International Conference on Learning Representations

2026

[14] [17]

Peiwen Yuan, Yueqi Zhang, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, and Kan Li. 2025. Beyond one-size-fits-all: Tailored benchmarks for efficient evaluation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

2025

[15] [18]

Taolin Zhang, Hang Guo, Wang Lu, Tao Dai, Shu-Tao Xia, and Jindong Wang. 2026. Sparse E val: Efficient evaluation of large language models by sparse optimization. In Proceedings of the 14th International Conference on Learning Representations

2026

[16] [20]

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2024. Instruction-following evaluation for large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

2024

[17] [21]

Polo, Felipe Maia and Weber, Lucas and Choshen, Leshem and Sun, Yuekai and Xu, Gongjun and Yurochkin, Mikhail , booktitle=. tiny

[18] [22]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics , year=

Efficient Benchmarking (of Language Models) , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics , year=

2024

[19] [23]

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics , year=

Anchor Points: Benchmarking Models with Much Fewer Examples , author=. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics , year=

[20] [24]

Proceedings of the 13th International Conference on Learning Representations , year=

metabench -- A Sparse Benchmark of Reasoning and Knowledge in Large Language Models , author=. Proceedings of the 13th International Conference on Learning Representations , year=

[21] [25]

Rethinking

Wang, Shaobo and Wang, Cong and Fu, Wenjie and Min, Yue and Feng, Mingquan and Guan, Isabel and Hu, Xuming and He, Conghui and Wang, Cunxiang and Yang, Kexin and Ren, Xingzhang and Huang, Fei and Liu, Dayiheng and Zhang, Linfeng , booktitle=. Rethinking

[22] [26]

Zhang, Taolin and Guo, Hang and Lu, Wang and Dai, Tao and Xia, Shu-Tao and Wang, Jindong , booktitle=. Sparse

[23] [27]

Active Evaluation Acquisition for Efficient

Li, Yang and others , booktitle=. Active Evaluation Acquisition for Efficient

[24] [28]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year=

Beyond One-Size-Fits-All: Tailored Benchmarks for Efficient Evaluation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year=

[25] [29]

Zhu, Xuechen and Tang, Kaiqiang and Pan, Yifei and Du, Mengdi and Xu, Jiaze and Jiang, Zhisong and Xie, Pengtao , booktitle=. Sub

[26] [30]

arXiv preprint arXiv:2502.10312 , year=

Scales++: Efficient Benchmark Evaluation at Scale through Item-Centric Assessment of Cognitive Abilities , author=. arXiv preprint arXiv:2502.10312 , year=

arXiv

[27] [31]

arXiv preprint arXiv:2502.07489 , year=

Less is More: A Submodular Approach for Efficient LLM Benchmark Selection , author=. arXiv preprint arXiv:2502.07489 , year=

Pith/arXiv arXiv

[28] [32]

Proceedings of the 42nd International Conference on Machine Learning , year=

Autoeval Done Right: Using Synthetic Data for Model Evaluation , author=. Proceedings of the 42nd International Conference on Machine Learning , year=

[29] [33]

1977 , publisher=

Sampling Techniques , author=. 1977 , publisher=

1977

[30] [34]

Journal of the Royal Statistical Society , volume=

On the Two Different Aspects of the Representative Method , author=. Journal of the Royal Statistical Society , volume=

[31] [35]

and Kroese, Dirk P

Rubinstein, Reuven Y. and Kroese, Dirk P. , edition=. Simulation and the. 2008 , publisher=

2008

[32] [36]

Proceedings of the International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations , year=

[33] [37]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year=

Instruction-Following Evaluation for Large Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year=

[34] [38]

arXiv preprint arXiv:2110.14168 , year=

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv

[35] [39]

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others , journal=

[36] [40]

, journal=

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , journal=

[37] [41]

arXiv preprint arXiv:2501.14249 , year=

Humanity's Last Exam , author=. arXiv preprint arXiv:2501.14249 , year=

Pith/arXiv arXiv

[38] [42]

Transactions on Machine Learning Research , year=

Holistic Evaluation of Language Models , author=. Transactions on Machine Learning Research , year=

[39] [43]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[40] [44]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and others , journal=. The

[41] [45]

arXiv preprint arXiv:2412.19437 , year=

Pith/arXiv arXiv

[42] [46]

and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and others , journal=

Jiang, Albert Q. and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and others , journal=. Mistral 7

[43] [47]

arXiv preprint arXiv:2412.08905 , year=

Phi-4 Technical Report , author=. arXiv preprint arXiv:2412.08905 , year=

Pith/arXiv arXiv

[44] [48]

arXiv preprint arXiv:2506.05176 , year=

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. arXiv preprint arXiv:2506.05176 , year=

Pith/arXiv arXiv

[45] [49]

2024 , publisher=

Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin Gregory and Bradley, Herbie and O'Brien, Kyle and others , title=. 2024 , publisher=

2024

[46] [50]

Fourrier, Cl\'. Open. Hugging Face , year=

[47] [51]

2010 , publisher=

Fair Representation: Meeting the Ideal of One Man, One Vote , author=. 2010 , publisher=

2010

[48] [52]

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , year=

Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo\". Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , year=

[49] [53]

Reddi, Vijay Janapa and Cheng, Christine and Kanter, David and Mattson, Peter and Schmuelling, Guenther and Wu, Carole-Jean and Anderson, Brian and Breeskin, Maximilien and Bughici, Mark and Cebo, Ciro and others , journal=