pith. sign in

arxiv: 2605.10075 · v2 · pith:5BX7IZ4Pnew · submitted 2026-05-11 · 💻 cs.AI

Active Testing of Large Language Models via Approximate Neyman Allocation

Pith reviewed 2026-05-20 23:01 UTC · model grok-4.3

classification 💻 cs.AI
keywords active testingNeyman allocationsemantic entropylarge language modelsgenerative evaluationsurrogate modelsbudget efficiencyvariance reduction
0
0 comments X

The pith

Approximate Neyman allocation guided by semantic entropy from surrogate models reduces error and budget in generative LLM evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an active testing procedure for large language models on generative tasks. It first computes semantic entropy on surrogate models to partition the evaluation pool into strata of varying informativeness, then performs approximate Neyman allocation that assigns more samples to strata with higher estimated variance. This produces a low-variance estimator of overall model performance from a smaller labeled subset. The method is evaluated on language and multimodal benchmarks across multiple surrogate-target pairs and shows consistent gains over uniform sampling. If the approach holds, repeated full-pool evaluations become less necessary as model scale and annotation expense continue to grow.

Core claim

Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates, delivering up to 28% MSE reduction over Uniform Sampling and an average of 22.9% budget savings while closely tracking Oracle-Neyman across multiple language and multimodal benchmarks.

What carries the argument

Approximate Neyman allocation on semantic entropy strata, which distributes the testing budget proportionally to estimated stratum variances to minimize the variance of the resulting performance estimator.

If this is right

  • The performance estimator variance drops when budget is shifted toward higher-entropy strata.
  • Labeling and compute costs fall by roughly one-fifth on average without increasing estimation error.
  • The same stratification and allocation steps apply to both pure language and multimodal generative benchmarks.
  • Results remain close to the ideal oracle that allocates using true target variances rather than surrogate proxies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If semantic entropy remains a stable proxy across model families, the same pipeline could be reused when new target models appear without recomputing full evaluations.
  • Savings would accumulate over successive development checkpoints, turning evaluation from a recurring heavy cost into a lighter recurring one.
  • The gap between surrogate and target entropy could itself become a diagnostic for when a new model diverges enough to require fresh stratification.

Load-bearing premise

Semantic entropy computed on surrogate models provides a reliable proxy for the variance or informativeness of examples with respect to the target model's performance on generative tasks.

What would settle it

An experiment on a new benchmark or model pair in which the entropy-stratified Neyman allocation produces higher mean squared error than uniform sampling for the same budget.

Figures

Figures reproduced from arXiv: 2605.10075 by Cong Liu, Jiancheng Zhang, Yinglun Zhu, Zeli Liu.

Figure 1
Figure 1. Figure 1: Performance comparison across four benchmarks with labeling budget [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison with adapted LURE variants. We report the relative MSE (lower is better) of SE-LURE, PE-LURE, and our method against Uniform Sampling across four benchmarks and model pairs at budget M = 100. trial t, and let T = 3,000 be the number of trials. We compute the mean squared error (MSE) MSE(Rb) = 1 T X T t=1  Rb(t) − RD 2 . To enable comparisons across different benchmarks, we also follow Berrada … view at source ↗
Figure 3
Figure 3. Figure 3: Results of active testing across four benchmarks with various surrogate-target model pairs. We [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm tailored to generative tasks. Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates. Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28% MSE reduction over Uniform Sampling and an average of 22.9% budget savings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces an active testing method for evaluating LLMs on generative tasks. It stratifies the evaluation pool using semantic entropy signals computed on one or more surrogate models and then applies approximate Neyman allocation driven by those surrogate-derived signals to select a small informative subset. The approach is tested across language and multimodal benchmarks with multiple surrogate-target model pairs, reporting up to 28% MSE reduction relative to uniform sampling, average 22.9% budget savings, and close tracking of an Oracle-Neyman baseline.

Significance. If the reported gains prove robust, the work would offer a practical extension of classical Neyman allocation to generative evaluation settings where existing active-testing techniques fail. The use of externally computed surrogate signals to enable stratification and allocation without requiring target-model labels during selection is a clear strength, and the proximity to the oracle baseline in the tested regimes suggests the surrogate signals can be informative. This could meaningfully reduce labeling and compute costs in repeated evaluation scenarios as model scales grow.

major comments (2)
  1. [Section 3 (Stratification and Approximate Neyman Allocation)] The central claim that surrogate semantic entropy enables effective stratification rests on the assumption that this quantity correlates with per-example variance or informativeness under the target model's generative metric. No direct correlation analysis, scatter plots, or quantitative validation linking surrogate semantic entropy to target-model error variance is presented; without this, the observed MSE reductions could arise from incidental properties of the chosen surrogate-target pairs rather than the intended mechanism.
  2. [Section 5] Section 5 (Experiments): The abstract and results claim statistically meaningful improvements and budget savings, yet the manuscript provides no information on the number of independent trials, statistical significance tests, confidence intervals, data exclusion criteria, or controls for surrogate-target similarity. These omissions are load-bearing for assessing whether the 28% MSE reduction and 22.9% savings are reliable or sensitive to experimental choices.
minor comments (2)
  1. [Section 3.2] The notation for the approximate Neyman weights and the precise definition of the surrogate-derived variance proxy could be stated more explicitly in the main text rather than deferred to the appendix.
  2. [Figures 2-4] Figure captions should clarify which curves correspond to which surrogate-target combinations to aid interpretation of the cross-pair results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of mechanistic validation and experimental rigor that we will address to strengthen the manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Section 3 (Stratification and Approximate Neyman Allocation)] The central claim that surrogate semantic entropy enables effective stratification rests on the assumption that this quantity correlates with per-example variance or informativeness under the target model's generative metric. No direct correlation analysis, scatter plots, or quantitative validation linking surrogate semantic entropy to target-model error variance is presented; without this, the observed MSE reductions could arise from incidental properties of the chosen surrogate-target pairs rather than the intended mechanism.

    Authors: We agree that a direct quantitative link between surrogate semantic entropy and target-model per-example variance would provide stronger mechanistic support for the stratification step. While the current results show consistent gains and close tracking of the Oracle-Neyman baseline across diverse surrogate-target pairs, these outcomes are indirect evidence. In the revised manuscript we will add a new subsection in Section 3 (or an appendix) containing scatter plots of surrogate semantic entropy versus target-model error variance for each evaluated pair, together with Pearson and Spearman correlation coefficients. This addition will allow readers to assess whether the observed MSE reductions are attributable to the intended variance-informed allocation rather than incidental surrogate properties. revision: yes

  2. Referee: [Section 5] Section 5 (Experiments): The abstract and results claim statistically meaningful improvements and budget savings, yet the manuscript provides no information on the number of independent trials, statistical significance tests, confidence intervals, data exclusion criteria, or controls for surrogate-target similarity. These omissions are load-bearing for assessing whether the 28% MSE reduction and 22.9% savings are reliable or sensitive to experimental choices.

    Authors: We acknowledge that the absence of these experimental details limits the ability to judge statistical reliability. Our original experiments used 10 independent trials per configuration with different random seeds for subset selection. In the revised Section 5 we will explicitly report: (i) the number of trials, (ii) results of paired t-tests against uniform sampling with p-values, (iii) 95% confidence intervals around the reported MSE reductions and budget savings, (iv) a statement that no data points were excluded beyond standard benchmark preprocessing, and (v) an analysis of surrogate-target similarity (including model-size and architecture differences) with results broken down by similarity strata. These additions will substantiate the robustness of the 28% MSE reduction and 22.9% average savings figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard Neyman allocation applied to external surrogate signals

full rationale

The paper describes an application of classical Neyman allocation (a 1934 statistical result) to strata formed by semantic entropy computed on surrogate models. No equations or procedures in the abstract or described method reduce the claimed MSE reductions or budget savings to a fitted parameter or self-referential definition; the performance claims are presented as empirical outcomes across surrogate-target pairs rather than tautological consequences of the construction itself. The derivation chain relies on externally computed signals and established allocation formulas without self-citation load-bearing or ansatz smuggling for the core result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that surrogate semantic entropy correlates with target-model evaluation variance on generative outputs; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Semantic entropy from surrogate models serves as a valid stratification signal for the variance of the target model's generative performance
    Invoked to justify dividing the evaluation pool before applying Neyman allocation.

pith-pipeline@v0.9.0 · 5689 in / 1221 out tokens · 34771 ms · 2026-05-20T23:01:09.742312+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 12 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

  2. [2]

    Scaling up active testing to large language models.arXiv preprint arXiv:2508.09093,

    Gabrielle Berrada, Jannik Kossen, Freddie Bickford Smith, Muhammed Razzak, Yarin Gal, and Tom Rainforth. Scaling up active testing to large language models.arXiv preprint arXiv:2508.09093,

  3. [3]

    An experimental design framework for label-efficient supervised finetuning of large language models

    Gantavya Bhatt, Yifang Chen, Arnav Das, Jifan Zhang, Sang Truong, Stephen Mussmann, Yinglun Zhu, Jeff Bilmes, Simon Du, Kevin Jamieson, et al. An experimental design framework for label-efficient supervised finetuning of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6549–6560,

  4. [4]

    On statistical bias in active learning: How and when to fix it.arXiv preprint arXiv:2101.11665,

    Sebastian Farquhar, Yarin Gal, and Tom Rainforth. On statistical bias in active learning: How and when to fix it.arXiv preprint arXiv:2101.11665,

  5. [5]

    Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologi...

  6. [6]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10,

  7. [7]

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664,

  8. [8]

    Mixtraining: A better trade-off between compute and performance.arXiv preprint arXiv:2502.19513,

    Zexin Li, Jiancheng Zhang, Yufei Li, Yinglun Zhu, and Cong Liu. Mixtraining: A better trade-off between compute and performance.arXiv preprint arXiv:2502.19513,

  9. [9]

    Aashish Anantha Ramakrishnan, Ardavan Saeedi, Hamid Reza Hassanzadeh, Fazlolah Mohaghegh, and Dongwon Lee

    URL https://qwen.ai/blog?id= qwen3.5. Aashish Anantha Ramakrishnan, Ardavan Saeedi, Hamid Reza Hassanzadeh, Fazlolah Mohaghegh, and Dongwon Lee. Generative active testing: Efficient llm evaluation via proxy task adaptation.arXiv preprint arXiv:2603.19264,

  10. [10]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,

  11. [11]

    Active Learning for Convolutional Neural Networks: A Core-Set Approach

    Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489,

  12. [12]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

  13. [13]

    Moq: mixture-of-format activation quantization for communication-efficient ai inference system

    Haonan Wang, Zeli Liu, Chao Fang, John Paul Walters, and Stephen P Crago. Moq: mixture-of-format activation quantization for communication-efficient ai inference system. InNeurIPS 2024 Workshop Machine Learning with new Compute Paradigms, 2024a. Haonan Wang, Zeli Liu, Kajimusugura Hoshino, Tuo Zhang, John Paul Walters, and Stephen Crago. Fedpai: Achieving...

  14. [14]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,

  15. [15]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,

  16. [16]

    Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

    Jiancheng Zhang and Yinglun Zhu. Towards multimodal active learning: Efficient learning with limited paired data.arXiv preprint arXiv:2510.03247,

  17. [17]

    Labelbench: A comprehensive framework for benchmarking adaptive label-efficient learning.arXiv preprint arXiv:2306.09910,

    Jifan Zhang, Yifang Chen, Gregory Canal, Stephen Mussmann, Arnav M Das, Gantavya Bhatt, Yinglun Zhu, Jeffrey Bilmes, Simon Shaolei Du, Kevin Jamieson, et al. Labelbench: A comprehensive framework for benchmarking adaptive label-efficient learning.arXiv preprint arXiv:2306.09910,

  18. [18]

    Accelerating unbiased llm evaluation via synthetic feedback

    Zhaoyi Zhou, Yuda Song, and Andrea Zanette. Accelerating unbiased llm evaluation via synthetic feedback. arXiv preprint arXiv:2502.10563,

  19. [19]

    Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

    Yinglun Zhu and Robert Nowak. Active learning with neural networks: Insights from nonparametric statistics. Advances in Neural Information Processing Systems, 35:142–155, 2022a. Yinglun Zhu and Robert Nowak. Efficient active learning with abstention.Advances in Neural Information Processing Systems, 35:35379–35391, 2022b. Yinglun Zhu, Jiancheng Zhang, and...

  20. [20]

    Strategic Scaling of Test-Time Compute: A Bandit Learning Approach

    Bowen Zuo and Yinglun Zhu. Strategic scaling of test-time compute: A bandit learning approach.arXiv preprint arXiv:2506.12721,

  21. [21]

    Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations

    Bowen Zuo, Dongruo Zhou, and Yinglun Zhu. Adaptive test-time compute allocation with evolving in-context demonstrations.arXiv preprint arXiv:2604.21018,