arxiv: 2604.11328 · v1 · submitted 2026-04-13 · 💻 cs.AI · cs.LG

Recognition: unknown

Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

Xiaoyu Ma , Yiwen Li , Haoyue Liu , Zhichao Wang , Ye Chen , Yongxin Guo , Xiaoying Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:42 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords prompt optimizationevaluation schedulingsubmodular optimizationitem response theoryautomatic prompt engineeringtoken efficiencyadaptive selection

0 comments

The pith

POES selects evaluation examples to discriminate strong prompt candidates, yielding higher accuracy with substantially lower token costs than random or fixed subsets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Automatic prompt optimization requires evaluating many prompt variants, but scoring each against the full training set quickly exhausts token budgets. The paper maps this to an online adaptive testing scenario where training examples act as test items chosen to separate the best prompts from the rest. It constructs a selection objective that adds an item-response-theory discrimination term, a facility-location coverage term, and switching-cost penalties for reusing prior selections; the resulting function is monotone submodular and therefore admits a greedy algorithm with a (1 − 1/e) approximation guarantee. An adaptive controller further tunes the balance between trying new items and exploiting known good ones as optimization advances. If the claim holds, practitioners can achieve higher final accuracy while spending far fewer tokens per evaluation round, effectively allowing more prompt candidates to be tested within the same compute envelope.

Core claim

POES frames automatic prompt optimization as the problem of adaptively selecting training examples that most effectively discriminate among candidate prompts. The method combines three components—an IRT discrimination utility that prioritizes items good at separating strong from weak prompts, a facility-location term that ensures broad coverage of the example space, and warm-start swaps that limit switching costs—into one objective proven to be monotone submodular. This property supplies a (1-1/e) guarantee for the greedy selector at cold starts and bounded performance drift under warm-start updates. An adaptive controller then modulates exploration versus exploitation according to how far t

What carries the argument

The unified submodular objective in POES, formed by summing an IRT-based discrimination utility, a facility-location coverage function, and switching-cost-aware warm-start terms, which enables greedy selection with formal guarantees while adapting to optimization progress.

If this is right

At any fixed evaluation budget the scheduler returns higher downstream prompt accuracy than fixed or heuristic baselines.
Reducing the evaluation set from 30-50 to 20 examples via principled selection preserves or improves performance, cutting token consumption by 35-60 percent.
The submodular guarantee allows the scheduler to be deployed without manual tuning of subset sizes.
Evaluation scheduling can be treated as an explicit, optimizable stage in prompt optimization pipelines rather than an afterthought.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The discrimination-plus-coverage logic could transfer to other settings where each iteration requires costly scoring against a large data pool, such as active learning loops or iterative model selection.
If submodularity survives more aggressive adaptive policies, it would support online selection algorithms that react to prompt-performance signals in real time without sacrificing approximation bounds.
Token savings of this magnitude could be reinvested to enlarge the search space of prompt candidates or to run longer optimization trajectories on the same hardware budget.

Load-bearing premise

The objective remains monotone submodular after the discrimination, coverage, and cost terms are combined and after the adaptive controller adjusts the exploration-exploitation balance.

What would settle it

Compare the final prompt accuracy obtained when using POES-selected subsets against accuracy obtained when using randomly selected subsets of identical size on a new task; absence of a consistent advantage would falsify the benefit of the submodular scheduling approach.

Figures

Figures reproduced from arXiv: 2604.11328 by Haoyue Liu, Xiaoying Tang, Xiaoyu Ma, Ye Chen, Yiwen Li, Yongxin Guo, Zhichao Wang.

**Figure 1.** Figure 1: (a) Optimization curves on BBH Navigate: static baselines plateau while POES (red) continues improving via prompt-aware subset adaptation, achieving +8.3% over the best baseline. (b) Accuracy vs. token consumption: POES at k=20 dominates all baselines (high accuracy, moderate cost); at k=10 it stays close to baseline accuracy (0.804 vs. 0.820) with 34% fewer tokens. The evaluation subset bottleneck. Most … view at source ↗

**Figure 2.** Figure 2: Overview of the POES framework. The scheduler (dashed box) integrates five compo [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Per-task accuracy comparison on rank-1 tasks (OPRO [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Analysis and ablation. (a) POES outperforms the baseline average on 86% of tasks (30W/1T/5L). (b) Improvement correlates negatively with baseline accuracy (r=−0.40, p=0.016): gains are largest on harder tasks. (c) Waterfall decomposes the 12.8pp ablation gain: adaptive (+6.5pp), warmup (+5.5pp), coverage (+0.8pp). (d) Warm-start matches or exceeds cold-start on all 4 tasks (+9% BBH Navigate, +11% BB Naviga… view at source ↗

**Figure 5.** Figure 5: Hyperparameter sensitivity. (a) Test accuracy across 8 benchmarks (faded lines) and their average (dark, ±1 SEM) as a function of candidate pool size k; performance plateaus beyond k=20. (b) Optimization token cost scales linearly with k; the shaded region marks diminishing returns. (c) IRT model comparison: the 1PL model (blue) outperforms the 2PL model (red) on average (+2.3pp), with the largest gap on D… view at source ↗

**Figure 6.** Figure 6: Optimization curves on 8 representative tasks. POES (red) reaches the highest final score [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation optimization curves (best seed) on three tasks. POES (red) reaches the highest or [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Per-task rank distribution (violin plot) for each scheduling method. Lower rank is better. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

read the original abstract

Automatic prompt optimization (APO) hinges on the quality of its evaluation signal, yet scoring every prompt candidate on the full training set is prohibitively expensive. Existing methods either fix a single evaluation subset before optimization begins (principled but prompt-agnostic) or adapt it heuristically during optimization (flexible but unstable and lacking formal guarantees). We observe that APO naturally maps to an online adaptive testing problem: prompts are examinees, training examples are test items, and the scheduler should select items that best discriminate among the strongest candidates. This insight motivates Prompt-Aware Online Evaluation Scheduling (POES), which integrates an IRT-based discrimination utility, a facility-location coverage term, and switching-cost-aware warm-start swaps into a unified objective that is provably monotone submodular, yielding a (1-1/e) greedy guarantee for cold starts and bounded drift for warm-start updates. An adaptive controller modulates the exploration-exploitation balance based on optimization progress. Across 36 tasks spanning three benchmark families, POES achieves the highest overall average accuracy (6.2 percent improvement over the best baseline) with negligible token overhead (approximately 4 percent) at the same evaluation budget. Moreover, principled selection at k = 20 examples matches or exceeds the performance of naive evaluation at k = 30-50, reducing token consumption by 35-60 percent, showing that selecting smarter is more effective than selecting more. Our results demonstrate that evaluation scheduling is a first-class component of APO, not an implementation detail.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POES combines IRT discrimination, facility-location coverage, and warm-start swaps into a claimed submodular scheduler for APO evaluations, delivering 6.2% accuracy gains and 35-60% token cuts at fixed budget, but the submodularity proof under adaptive control is not shown.

read the letter

The core contribution is a scheduling method that treats prompt optimization as online adaptive testing and picks evaluation examples via a single objective mixing item response theory discrimination, set coverage, and low-switch warm starts. This is new in the APO literature, where prior work either locked in a fixed subset or used ad-hoc adaptation without guarantees. The empirical side holds up better than the theory: on 36 tasks the method beats the strongest baseline by 6.2% average accuracy while adding only 4% token overhead, and k=20 selected examples match or beat naive k=30-50 runs. That practical payoff is the clearest reason to pay attention. The submodularity claim is the soft spot. The abstract states the composite function stays monotone submodular after the adaptive controller modulates exploration-exploitation, which would give the standard (1-1/e) greedy bound and bounded drift on updates. No derivation, proof sketch, or counter-example check appears in the provided text, so it is impossible to verify whether the IRT term or the modulation step preserves the property. If either breaks monotonicity or submodularity, the formal justification collapses and we are left with another heuristic. The experiments also lack visible detail on baseline re-implementations, variance reporting, or multiple-run statistics, which makes the 6.2% figure harder to trust at face value. This paper is aimed at researchers who already work on automatic prompt optimization or efficient LLM evaluation pipelines. Anyone looking for a concrete way to reduce evaluation cost while keeping a principled selection rule will find usable ideas here, even if the theory section needs tightening. It is worth sending to peer review because the problem is timely, the empirical pattern is consistent across many tasks, and the submodular framing is a reasonable direction; referees can press on the missing proof and experimental controls without the work being dismissed outright.

Referee Report

2 major / 1 minor

Summary. The paper claims that by framing automatic prompt optimization (APO) as an online adaptive testing problem, one can design Prompt-Aware Online Evaluation Scheduling (POES) using a composite objective of IRT discrimination, facility location coverage, and switching costs that is monotone submodular. This yields a (1-1/e) greedy guarantee and allows an adaptive controller. Experiments across 36 tasks show POES achieves 6.2% higher average accuracy than the best baseline with ~4% token overhead, and that k=20 principled selection outperforms naive k=30-50, saving tokens.

Significance. Should the submodularity property be established, this provides a principled, guaranteed-efficient method for evaluation in APO, which is a key bottleneck. The empirical demonstration of performance gains and token reduction at fixed budget underscores the value of smart scheduling over simply using more examples. It positions evaluation scheduling as central to APO rather than an afterthought.

major comments (2)

[Abstract] The assertion that the unified objective is 'provably monotone submodular' yielding the (1-1/e) guarantee is made without any proof sketch, derivation, or verification of submodularity preservation after combining terms and under adaptive modulation. This is load-bearing for the theoretical justification of POES over heuristics.
[Results section] The reported 6.2% average accuracy improvement and token savings lack details on statistical controls, number of runs, variance, or precise baseline implementations, making it difficult to assess the robustness of the empirical claims.

minor comments (1)

Clarify the specific benchmark families and tasks used in the 36-task evaluation for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps strengthen the presentation of both the theoretical guarantees and empirical results for POES. We address each major comment point by point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] The assertion that the unified objective is 'provably monotone submodular' yielding the (1-1/e) guarantee is made without any proof sketch, derivation, or verification of submodularity preservation after combining terms and under adaptive modulation. This is load-bearing for the theoretical justification of POES over heuristics.

Authors: The full manuscript (Section 3) establishes monotonicity and submodularity separately for the IRT discrimination utility, the facility-location coverage term, and the switching-cost penalty; it then proves that their non-negative linear combination remains monotone submodular and that the adaptive controller induces only bounded drift, preserving the (1-1/e) greedy guarantee for cold-start selection. Because the abstract is space-constrained, we omitted an explicit sketch there. In the revision we will insert a concise two-sentence proof outline immediately after the claim in the abstract and add a pointer to the full derivation in Section 3. revision: yes
Referee: [Results section] The reported 6.2% average accuracy improvement and token savings lack details on statistical controls, number of runs, variance, or precise baseline implementations, making it difficult to assess the robustness of the empirical claims.

Authors: We agree that additional statistical detail is warranted. The 6.2 % figure is the mean improvement across 36 tasks, each evaluated with 5 independent random seeds; standard deviations and 95 % confidence intervals will be reported in the revised results tables. We will also expand the experimental-setup subsection to specify exact baseline configurations (including prompt-selection heuristics, evaluation budgets, and hyper-parameters) and to confirm that all methods were run under identical token budgets and model checkpoints. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs the POES objective by combining standard external components (IRT discrimination utility, facility-location coverage, switching-cost warm-start swaps) into a unified function asserted to be monotone submodular, yielding the (1-1/e) greedy guarantee. This is not self-definitional, as the submodularity is claimed to follow from the properties of the combined terms rather than being defined in terms of the target APO accuracy or fitted parameters. No load-bearing self-citations, ansatz smuggling, or renaming of known results appear in the abstract or described chain. Experiments across 36 tasks supply independent empirical support. The adaptive modulation is stated to preserve the property without reducing the central claim to a tautology or input fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all modeling assumptions are implicit in the IRT and submodularity claims.

pith-pipeline@v0.9.0 · 5595 in / 1238 out tokens · 37132 ms · 2026-05-10T15:42:59.960255+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 16 canonical work pages · 3 internal anchors

[1]

Large language models are human-level prompt engineers, 2023

Y Zhou, AI Muresanu, Z Han, K Paster, S Pitis, H Chan, and J Ba. Large language models are human-level prompt engineers (arxiv: 2211.01910). arxiv, 2023

work page arXiv 2023
[2]

Large language models as optimizers

Chengrun Y ang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. In The Twelfth International Conference on Learning Representations, 2023

2023
[3]

Quantifying language models’ sensitivity to spurious features in prompt design.arXiv preprint arXiv:2310.11324, 2023

Melanie Sclar, Y ejin Choi, Y ulia Tsvetkov, and Alane Suhr. Quantifying language models sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arxiv 2310.11324 [preprint] https://arxiv. org/abs/2310.11324. published october 17, 2023. Accessed January, 2024

work page arXiv 2023
[4]

Language mod- els are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language mod- els are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

1901
[5]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022

2022
[6]

Automatic prompt optimization with gradient descent and beam search

Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with gradient descent and beam search. In Proceedings of the 2023 con- ference on empirical methods in natural language processing , pages 7957–7968, 2023

2023
[7]

Evoprompt: Connecting large language models with evolutionary algorithms for prompt engineering

Q Guo, R Wang, J Wang, B Li, K He, X Tan, J Bian, and Y Zheng. Evoprompt: Connecting large language models with evolutionary algorithms for prompt engineering. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria , pages 7–11, 2024

2024
[8]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri V ardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review arXiv 2023
[9]

Submodular eval- uation subset selection in automatic prompt optimization

Jinming Nian, Zhiyuan Peng, Hongwei Shang, Dae Hoon Park, and Yi Fang. Submodular eval- uation subset selection in automatic prompt optimization. arXiv preprint arXiv:2601.03493 , 2026

work page arXiv 2026
[10]

Model performance-guided evaluation data selection for effective prompt optimization

Ximing Dong, Shaowei Wang, Dayi Lin, and Ahmed Hassan. Model performance-guided evaluation data selection for effective prompt optimization. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 2844–2859, 2025. 11

2025
[11]

Grips: Gradient-free, edit-based instruction search for prompting large language models

Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal. Grips: Gradient-free, edit-based instruction search for prompting large language models. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages 3845–3864, 2023

2023
[12]

TextGrad: Automatic "Differentiation" via Text

Mert Y uksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review arXiv 2024
[13]

Promptbreeder: Self-referential self-improvement via prompt evolution, 2023

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023

work page arXiv 2023
[14]

Instructzero: Efﬁcient instruction optimization for black-box large language models

Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou. Instructzero: Efﬁcient instruction optimization for black-box large language models. arXiv preprint arXiv:2306.03082, 2023

work page arXiv 2023
[15]

arXiv preprint arXiv:2310.16427 , year=

Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P Xing, and Zhiting Hu. Promptagent: Strategic planning with language models enables expert-level prompt optimization. arXiv preprint arXiv:2310.16427, 2023

work page arXiv 2023
[16]

Prompt optimization with ease? efﬁcient ordering- aware automated selection of exemplars

Zhaoxuan Wu, Xiaoqiang Lin, Zhongxiang Dai, Wenyang Hu, Y ao Shu, See-Kiong Ng, Patrick Jaillet, and Bryan Kian Hsiang Low. Prompt optimization with ease? efﬁcient ordering- aware automated selection of exemplars. Advances in Neural Information Processing Systems , 37:122706–122740, 2024

2024
[17]

Efﬁcient prompt optimization through the lens of best arm identiﬁcation

Chengshuai Shi, Kun Y ang, Zihan Chen, Jundong Li, Jing Y ang, and Cong Shen. Efﬁcient prompt optimization through the lens of best arm identiﬁcation. Advances in Neural Informa- tion Processing Systems, 37:99646–99685, 2024

2024
[18]

Adaptive prompt structure factorization: A framework for self-discovering and optimizing composi- tional prompt programs, 2026

Haoyue Liu, Zhichao Wang, Y ongxin Guo, Haoran Shou, and Xiaoying Tang. Adaptive prompt structure factorization: A framework for self-discovering and optimizing composi- tional prompt programs, 2026

2026
[19]

Applications of item response theory to practical testing problems

Frederic M Lord. Applications of item response theory to practical testing problems . Rout- ledge, 2012

2012
[20]

Computerized adaptive testing: Theory and practice, volume 13

Wim J V an der Linden, Cees AW Glas, et al. Computerized adaptive testing: Theory and practice, volume 13. Springer, 2000

2000
[21]

tinyBenchmarks : evaluating LLMs with fewer examples

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Y uekai Sun, Gongjun Xu, and Mikhail Y urochkin. tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992, 2024

work page arXiv 2024
[22]

metabench– a sparse benchmark of reasoning and knowledge in large language models

Alex Kipnis, Konstantinos V oudouris, Luca M Schulze Buschoff, and Eric Schulz. metabench– a sparse benchmark of reasoning and knowledge in large language models. arXiv preprint arXiv:2407.12844, 2024

work page arXiv 2024
[23]

Item response theory in ai: Analysing machine learning classiﬁers at the instance level

Fernando Martínez-Plumed, Ricardo BC Prudêncio, Adolfo Martínez-Usó, and José Hernández-Orallo. Item response theory in ai: Analysing machine learning classiﬁers at the instance level. Artiﬁcial intelligence, 271:18–42, 2019

2019
[24]

Position: Ai evaluation should learn from how we test humans

Y an Zhuang, Qi Liu, Zachary Pardos, Patrick C Kyllonen, Jiyun Zu, Zhenya Huang, Shijin Wang, and Enhong Chen. Position: Ai evaluation should learn from how we test humans. In F orty-second International Conference on Machine Learning Position Paper Track, 2025

2025
[25]

An analysis of approxima- tions for maximizing submodular set functionsi

George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approxima- tions for maximizing submodular set functionsi. Mathematical programming, 14(1):265–294, 1978

1978
[26]

Submodular function maximization

Andreas Krause and Daniel Golovin. Submodular function maximization. Tractability, 3(71- 104):3, 2014

2014
[27]

Lazier than lazy greedy

Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan V ondrák, and An- dreas Krause. Lazier than lazy greedy. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 29, 2015

2015
[28]

An online algorithm for maximizing submodular func- tions

Matthew Streeter and Daniel Golovin. An online algorithm for maximizing submodular func- tions. Advances in Neural Information Processing Systems , 21, 2008. 12

2008
[29]

Online submodular maximization under a matroid constraint with application to learning assignments.arXiv preprint arXiv:1407.1082, 2014

Daniel Golovin, Andreas Krause, and Matthew Streeter. Online submodular maximiza- tion under a matroid constraint with application to learning assignments. arXiv preprint arXiv:1407.1082, 2014

work page arXiv 2014
[30]

Information complexity in bandit subset selection

Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset selection. In Conference on Learning Theory , pages 228–251. PMLR, 2013

2013
[31]

Bandits with switching costs: T 2/3 regret

Ofer Dekel, Jian Ding, Tomer Koren, and Y uval Peres. Bandits with switching costs: T 2/3 regret. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing , pages 459–467, 2014

2014
[32]

A 2-competitive algorithm for online convex optimization with switching costs

Nikhil Bansal, Anupam Gupta, Ravishankar Krishnaswamy, Kirk Pruhs, Kevin Schewior, and Cliff Stein. A 2-competitive algorithm for online convex optimization with switching costs. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2015), pages 96–109. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2015

2015
[33]

Coresets for data-efﬁcient training of machine learning models

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efﬁcient training of machine learning models. In International Conference on Machine Learning , pages 6950–
[34]

D2 pruning: Message passing for balancing diversity and difﬁculty in data pruning

Adyasha Maharana, Prateek Y adav, and Mohit Bansal. D2 pruning: Message passing for balancing diversity and difﬁculty in data pruning. arXiv preprint arXiv:2310.07931, 2023

work page arXiv 2023
[35]

Curriculum learning

Y oshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning , ICML ’09, page 4148, New Y ork, NY , USA, 2009. Association for Computing Machinery

2009
[36]

Active Learning for Convolutional Neural Networks: A Core-Set Approach

Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017

work page Pith review arXiv 2017
[37]

arXiv preprint arXiv:1906.03671 , year=

Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agar- wal. Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671, 2019

work page arXiv 1906
[38]

Glister: Generalization based data subset selection for efﬁcient and robust learning

Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer. Glister: Generalization based data subset selection for efﬁcient and robust learning. In Pro- ceedings of the AAAI conference on artiﬁcial intelligence , volume 35, pages 8110–8118, 2021

2021
[39]

Selec- tion via proxy: Efficient data selection for deep learning

Cody Coleman, Christopher Y eh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efﬁcient data selection for deep learning. arXiv preprint arXiv:1906.11829, 2019

work page arXiv 1906
[40]

Deep learning on a data diet: Finding important examples early in training

Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. Advances in neural information processing systems, 34:20596–20607, 2021

2021
[41]

Monotone submodular maximization over a matroid via non- oblivious local search

Y uval Filmus and Justin Ward. Monotone submodular maximization over a matroid via non- oblivious local search. SIAM Journal on Computing , 43(2):514–542, 2014

2014
[42]

Maximizing non-monotone submodular functions

Uriel Feige, V ahab S Mirrokni, and Jan V ondrák. Maximizing non-monotone submodular functions. SIAM Journal on Computing , 40(4):1133–1153, 2011

2011
[43]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Compu- tational Linguistics: ACL 2023 , pages 13003–13051, 2023

2023
[44]

Be- yond the imitation game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Be- yond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on machine learning research, 2023

2023
[45]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[46]

worker-as-judge

Rishi Bommasani, Percy Liang, and Tony Lee. Holistic evaluation of language models. Annals of the New York Academy of Sciences , 1525(1):140–146, 2023. 13 Contents of Appendix: Section A Algorithm Pseudocode · Section B Complete Proofs · Section C Experiment Settings · Section D Dataset Details · Section E Additional Experiments · Section E.3 Parameter Se...

2023
[47]

Noise reduction: A well-chosen subset can ﬁlter out noisy or uninformative examples that dilute the evaluation signal
[48]

Focus effect: By concentrating evaluation on discriminative examples, the optimizer receives sharper feedback about which prompts are truly better
[49]

calculate the ﬁnal coordinates

Budget reallocation: The cost savings from subset evaluation can be reinvested into more opti- mization steps or more prompt candidates per step. This phenomenon is analogous to the data pruning literature [ 34], where training on a carefully selected subset can match or exceed full-data training. I Broader Impact This work addresses evaluation subset sch...
[50]

During the initial rounds, POES uses a random subset identical to the Random baseline

Warmup provides a stable foundation. During the initial rounds, POES uses a random subset identical to the Random baseline. This is deliberate: the IRT model requires a minimum number 22 Table 11: Qualitative evolution of the POES evaluation subset on BBH Navigate (seed 44). During warmup, the subset is random; after warmup exit, it is actively reﬁned via...
[51]

Transition to active scheduling is data-driven. The warmup-to-active transition is triggered when the discrimination ratio exceeds the exit threshold ρexit, indicating that at least some ex- amples have become meaningfully more informative than the average. On BBH Navigate, this typically occurs at round 2–3
[52]

After warmup exit, the subset evolves gradually: the swap budget Bt limits the number of items that can change per round (typically 2–4 out of k=20)

Bounded swaps ensure stability. After warmup exit, the subset evolves gradually: the swap budget Bt limits the number of items that can change per round (typically 2–4 out of k=20). This prevents the erratic subset changes observed with IPOMP , which can replace up to 100% of the subset in a single round
[53]

discriminative

Contrast with static methods. Random and SESS both use a ﬁxed subset from round 1 through the ﬁnal round. While SESS’s subset is more principled (selected via submodular optimization over embedding diversity), it cannot adapt to the changing prompt population. As optimization progresses and top prompts converge, the discriminative examples shift—but stati...
[54]

Claims Question: Do the main claims made in the abstract and introduction accurately reﬂect the paper’s contributions and scope? Answer: [Y es] Justiﬁcation: The abstract and introduction clearly state our three contributions (formu- lation, algorithm, experiments) and the experimental claims are supported by results in Section 4
[55]

2PL trade-off, scaling to larger pools and generation tasks, and reduced gains when all prompts already perform similarly

Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Y es] Justiﬁcation: Section 6 discusses four speciﬁc limitations: lack of end-to-end APO conver- gence guarantees, the 1PL model simplicity vs. 2PL trade-off, scaling to larger pools and generation tasks, and reduced gains when all prompts already p...
[56]

Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (or correct) proof? Answer: [Y es] Justiﬁcation: All four propositions are formally stated in Section 3.6 with complete proofs in Section B
[57]

Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclu- sions of the paper? Answer: [Y es] Justiﬁcation: Section C provides complete hyperparameter conﬁgurations, Section D de- scribes all data...
[58]

All datasets used are publicly available benchmarks (BBH, BigBench, MMLU, GSM8K, MA TH, MultiArith)

Open access to data and code Question: Does the paper provide open access to the data and code, with sufﬁcient instruc- tions to faithfully reproduce the main experimental results? Answer: [Y es] Justiﬁcation: Code and data will be released upon acceptance. All datasets used are publicly available benchmarks (BBH, BigBench, MMLU, GSM8K, MA TH, MultiArith)
[59]

Experimental Setting/Details Question: Does the paper specify all the training and test details necessary to understand the results? Answer: [Y es] Justiﬁcation: Section 4 describes the experimental setup, Section C provides all hyperpa- rameters (Table 4), and Section D details all benchmark conﬁgurations
[60]

The scheduler diagnostics in Section F report means with stan- dard deviations

Experiment Statistical Signiﬁcance Question: Does the paper report error bars suitably and correctly? Answer: [Y es] Justiﬁcation: All experiments are run with multiple random seeds and main-table results report cross-seed averages. The scheduler diagnostics in Section F report means with stan- dard deviations
[61]

Experiments Compute Resources 26 Question: For each experiment, does the paper provide sufﬁcient information on the com- puter resources needed to reproduce the experiments? Answer: [Y es] Justiﬁcation: Section C speciﬁes GPU types (NVIDIA A100-80GB), model serving details (vLLM), and Table 3 reports token consumption and wall-clock time for all methods
[62]

Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics? Answer: [Y es] Justiﬁcation: This work focuses on evaluation scheduling for prompt optimization and does not involve human subjects, deception, or harmful applications
[63]

Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Y es] Justiﬁcation: Section I discusses both positive impacts (reduced computational cost/carbon footprint) and potential risks (lowering barriers to adversarial prompt engineering) with appropriate mitigations
[64]

Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models with a high risk for misuse? Answer: [NA] Justiﬁcation: This work releases a scheduling algorithm, not a trained model or dataset with misuse risk
[65]

All are publicly available under permissive licenses

Licenses for existing assets Question: Are the creators or original owners of assets used in the paper properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Y es] Justiﬁcation: All benchmarks (BBH, BigBench, MMLU, GSM8K, MA TH, MultiArith) and models (Llama-3.1-8B) are properly cited. All are publicly...
[66]

New Assets Question: Are new assets introduced in the paper well documented and is the documenta- tion provided alongside the assets? Answer: [Y es] Justiﬁcation: Our code release will include documentation, conﬁguration ﬁles, and instruc- tions for reproducing all experiments
[67]

Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the pa- per include the full text of instructions given to participants? Answer: [NA] Justiﬁcation: This work does not involve crowdsourcing or human subjects
[68]

Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants? Answer: [NA] Justiﬁcation: This work does not involve human subjects research
[69]

Declaration of LLM usage 27 Question: Does the paper describe the usage of LLMs in the core methodology? Answer: [Y es] Justiﬁcation: Section 3 and Section 4 fully describe the use of LLMs (Llama-3.1-8B as worker, GPT-OSS-120B as meta-optimizer) including model conﬁgurations and API de- tails. 28