Active Testing of Large Language Models via Approximate Neyman Allocation
Pith reviewed 2026-05-20 23:01 UTC · model grok-4.3
The pith
Approximate Neyman allocation guided by semantic entropy from surrogate models reduces error and budget in generative LLM evaluation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates, delivering up to 28% MSE reduction over Uniform Sampling and an average of 22.9% budget savings while closely tracking Oracle-Neyman across multiple language and multimodal benchmarks.
What carries the argument
Approximate Neyman allocation on semantic entropy strata, which distributes the testing budget proportionally to estimated stratum variances to minimize the variance of the resulting performance estimator.
If this is right
- The performance estimator variance drops when budget is shifted toward higher-entropy strata.
- Labeling and compute costs fall by roughly one-fifth on average without increasing estimation error.
- The same stratification and allocation steps apply to both pure language and multimodal generative benchmarks.
- Results remain close to the ideal oracle that allocates using true target variances rather than surrogate proxies.
Where Pith is reading between the lines
- If semantic entropy remains a stable proxy across model families, the same pipeline could be reused when new target models appear without recomputing full evaluations.
- Savings would accumulate over successive development checkpoints, turning evaluation from a recurring heavy cost into a lighter recurring one.
- The gap between surrogate and target entropy could itself become a diagnostic for when a new model diverges enough to require fresh stratification.
Load-bearing premise
Semantic entropy computed on surrogate models provides a reliable proxy for the variance or informativeness of examples with respect to the target model's performance on generative tasks.
What would settle it
An experiment on a new benchmark or model pair in which the entropy-stratified Neyman allocation produces higher mean squared error than uniform sampling for the same budget.
Figures
read the original abstract
Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm tailored to generative tasks. Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates. Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28% MSE reduction over Uniform Sampling and an average of 22.9% budget savings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces an active testing method for evaluating LLMs on generative tasks. It stratifies the evaluation pool using semantic entropy signals computed on one or more surrogate models and then applies approximate Neyman allocation driven by those surrogate-derived signals to select a small informative subset. The approach is tested across language and multimodal benchmarks with multiple surrogate-target model pairs, reporting up to 28% MSE reduction relative to uniform sampling, average 22.9% budget savings, and close tracking of an Oracle-Neyman baseline.
Significance. If the reported gains prove robust, the work would offer a practical extension of classical Neyman allocation to generative evaluation settings where existing active-testing techniques fail. The use of externally computed surrogate signals to enable stratification and allocation without requiring target-model labels during selection is a clear strength, and the proximity to the oracle baseline in the tested regimes suggests the surrogate signals can be informative. This could meaningfully reduce labeling and compute costs in repeated evaluation scenarios as model scales grow.
major comments (2)
- [Section 3 (Stratification and Approximate Neyman Allocation)] The central claim that surrogate semantic entropy enables effective stratification rests on the assumption that this quantity correlates with per-example variance or informativeness under the target model's generative metric. No direct correlation analysis, scatter plots, or quantitative validation linking surrogate semantic entropy to target-model error variance is presented; without this, the observed MSE reductions could arise from incidental properties of the chosen surrogate-target pairs rather than the intended mechanism.
- [Section 5] Section 5 (Experiments): The abstract and results claim statistically meaningful improvements and budget savings, yet the manuscript provides no information on the number of independent trials, statistical significance tests, confidence intervals, data exclusion criteria, or controls for surrogate-target similarity. These omissions are load-bearing for assessing whether the 28% MSE reduction and 22.9% savings are reliable or sensitive to experimental choices.
minor comments (2)
- [Section 3.2] The notation for the approximate Neyman weights and the precise definition of the surrogate-derived variance proxy could be stated more explicitly in the main text rather than deferred to the appendix.
- [Figures 2-4] Figure captions should clarify which curves correspond to which surrogate-target combinations to aid interpretation of the cross-pair results.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of mechanistic validation and experimental rigor that we will address to strengthen the manuscript. We respond to each major comment below.
read point-by-point responses
-
Referee: [Section 3 (Stratification and Approximate Neyman Allocation)] The central claim that surrogate semantic entropy enables effective stratification rests on the assumption that this quantity correlates with per-example variance or informativeness under the target model's generative metric. No direct correlation analysis, scatter plots, or quantitative validation linking surrogate semantic entropy to target-model error variance is presented; without this, the observed MSE reductions could arise from incidental properties of the chosen surrogate-target pairs rather than the intended mechanism.
Authors: We agree that a direct quantitative link between surrogate semantic entropy and target-model per-example variance would provide stronger mechanistic support for the stratification step. While the current results show consistent gains and close tracking of the Oracle-Neyman baseline across diverse surrogate-target pairs, these outcomes are indirect evidence. In the revised manuscript we will add a new subsection in Section 3 (or an appendix) containing scatter plots of surrogate semantic entropy versus target-model error variance for each evaluated pair, together with Pearson and Spearman correlation coefficients. This addition will allow readers to assess whether the observed MSE reductions are attributable to the intended variance-informed allocation rather than incidental surrogate properties. revision: yes
-
Referee: [Section 5] Section 5 (Experiments): The abstract and results claim statistically meaningful improvements and budget savings, yet the manuscript provides no information on the number of independent trials, statistical significance tests, confidence intervals, data exclusion criteria, or controls for surrogate-target similarity. These omissions are load-bearing for assessing whether the 28% MSE reduction and 22.9% savings are reliable or sensitive to experimental choices.
Authors: We acknowledge that the absence of these experimental details limits the ability to judge statistical reliability. Our original experiments used 10 independent trials per configuration with different random seeds for subset selection. In the revised Section 5 we will explicitly report: (i) the number of trials, (ii) results of paired t-tests against uniform sampling with p-values, (iii) 95% confidence intervals around the reported MSE reductions and budget savings, (iv) a statement that no data points were excluded beyond standard benchmark preprocessing, and (v) an analysis of surrogate-target similarity (including model-size and architecture differences) with results broken down by similarity strata. These additions will substantiate the robustness of the 28% MSE reduction and 22.9% average savings figures. revision: yes
Circularity Check
No significant circularity; standard Neyman allocation applied to external surrogate signals
full rationale
The paper describes an application of classical Neyman allocation (a 1934 statistical result) to strata formed by semantic entropy computed on surrogate models. No equations or procedures in the abstract or described method reduce the claimed MSE reductions or budget savings to a fitted parameter or self-referential definition; the performance claims are presented as empirical outcomes across surrogate-target pairs rather than tautological consequences of the construction itself. The derivation chain relies on externally computed signals and established allocation formulas without self-citation load-bearing or ansatz smuggling for the core result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Semantic entropy from surrogate models serves as a valid stratification signal for the variance of the target model's generative performance
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Budget allocation. ... mh ∝ Nh · (√(ph(1−ph)) + δ)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Scaling up active testing to large language models.arXiv preprint arXiv:2508.09093,
Gabrielle Berrada, Jannik Kossen, Freddie Bickford Smith, Muhammed Razzak, Yarin Gal, and Tom Rainforth. Scaling up active testing to large language models.arXiv preprint arXiv:2508.09093,
-
[3]
An experimental design framework for label-efficient supervised finetuning of large language models
Gantavya Bhatt, Yifang Chen, Arnav Das, Jifan Zhang, Sang Truong, Stephen Mussmann, Yinglun Zhu, Jeff Bilmes, Simon Du, Kevin Jamieson, et al. An experimental design framework for label-efficient supervised finetuning of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6549–6560,
work page 2024
-
[4]
On statistical bias in active learning: How and when to fix it.arXiv preprint arXiv:2101.11665,
Sebastian Farquhar, Yarin Gal, and Tom Rainforth. On statistical bias in active learning: How and when to fix it.arXiv preprint arXiv:2101.11665,
-
[5]
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologi...
work page 2025
-
[6]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Mixtraining: A better trade-off between compute and performance.arXiv preprint arXiv:2502.19513,
Zexin Li, Jiancheng Zhang, Yufei Li, Yinglun Zhu, and Cong Liu. Mixtraining: A better trade-off between compute and performance.arXiv preprint arXiv:2502.19513,
-
[9]
URL https://qwen.ai/blog?id= qwen3.5. Aashish Anantha Ramakrishnan, Ardavan Saeedi, Hamid Reza Hassanzadeh, Fazlolah Mohaghegh, and Dongwon Lee. Generative active testing: Efficient llm evaluation via proxy task adaptation.arXiv preprint arXiv:2603.19264,
-
[10]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Active Learning for Convolutional Neural Networks: A Core-Set Approach
Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Moq: mixture-of-format activation quantization for communication-efficient ai inference system
Haonan Wang, Zeli Liu, Chao Fang, John Paul Walters, and Stephen P Crago. Moq: mixture-of-format activation quantization for communication-efficient ai inference system. InNeurIPS 2024 Workshop Machine Learning with new Compute Paradigms, 2024a. Haonan Wang, Zeli Liu, Kajimusugura Hoshino, Tuo Zhang, John Paul Walters, and Stephen Crago. Fedpai: Achieving...
work page 2024
-
[14]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data
Jiancheng Zhang and Yinglun Zhu. Towards multimodal active learning: Efficient learning with limited paired data.arXiv preprint arXiv:2510.03247,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Jifan Zhang, Yifang Chen, Gregory Canal, Stephen Mussmann, Arnav M Das, Gantavya Bhatt, Yinglun Zhu, Jeffrey Bilmes, Simon Shaolei Du, Kevin Jamieson, et al. Labelbench: A comprehensive framework for benchmarking adaptive label-efficient learning.arXiv preprint arXiv:2306.09910,
-
[18]
Accelerating unbiased llm evaluation via synthetic feedback
Zhaoyi Zhou, Yuda Song, and Andrea Zanette. Accelerating unbiased llm evaluation via synthetic feedback. arXiv preprint arXiv:2502.10563,
-
[19]
Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
Yinglun Zhu and Robert Nowak. Active learning with neural networks: Insights from nonparametric statistics. Advances in Neural Information Processing Systems, 35:142–155, 2022a. Yinglun Zhu and Robert Nowak. Efficient active learning with abstention.Advances in Neural Information Processing Systems, 35:35379–35391, 2022b. Yinglun Zhu, Jiancheng Zhang, and...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Strategic Scaling of Test-Time Compute: A Bandit Learning Approach
Bowen Zuo and Yinglun Zhu. Strategic scaling of test-time compute: A bandit learning approach.arXiv preprint arXiv:2506.12721,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations
Bowen Zuo, Dongruo Zhou, and Yinglun Zhu. Adaptive test-time compute allocation with evolving in-context demonstrations.arXiv preprint arXiv:2604.21018,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.