pith. machine review for the scientific record. sign in

arxiv: 2604.11328 · v1 · submitted 2026-04-13 · 💻 cs.AI · cs.LG

Recognition: unknown

Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:42 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords prompt optimizationevaluation schedulingsubmodular optimizationitem response theoryautomatic prompt engineeringtoken efficiencyadaptive selection
0
0 comments X

The pith

POES selects evaluation examples to discriminate strong prompt candidates, yielding higher accuracy with substantially lower token costs than random or fixed subsets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Automatic prompt optimization requires evaluating many prompt variants, but scoring each against the full training set quickly exhausts token budgets. The paper maps this to an online adaptive testing scenario where training examples act as test items chosen to separate the best prompts from the rest. It constructs a selection objective that adds an item-response-theory discrimination term, a facility-location coverage term, and switching-cost penalties for reusing prior selections; the resulting function is monotone submodular and therefore admits a greedy algorithm with a (1 − 1/e) approximation guarantee. An adaptive controller further tunes the balance between trying new items and exploiting known good ones as optimization advances. If the claim holds, practitioners can achieve higher final accuracy while spending far fewer tokens per evaluation round, effectively allowing more prompt candidates to be tested within the same compute envelope.

Core claim

POES frames automatic prompt optimization as the problem of adaptively selecting training examples that most effectively discriminate among candidate prompts. The method combines three components—an IRT discrimination utility that prioritizes items good at separating strong from weak prompts, a facility-location term that ensures broad coverage of the example space, and warm-start swaps that limit switching costs—into one objective proven to be monotone submodular. This property supplies a (1-1/e) guarantee for the greedy selector at cold starts and bounded performance drift under warm-start updates. An adaptive controller then modulates exploration versus exploitation according to how far t

What carries the argument

The unified submodular objective in POES, formed by summing an IRT-based discrimination utility, a facility-location coverage function, and switching-cost-aware warm-start terms, which enables greedy selection with formal guarantees while adapting to optimization progress.

If this is right

  • At any fixed evaluation budget the scheduler returns higher downstream prompt accuracy than fixed or heuristic baselines.
  • Reducing the evaluation set from 30-50 to 20 examples via principled selection preserves or improves performance, cutting token consumption by 35-60 percent.
  • The submodular guarantee allows the scheduler to be deployed without manual tuning of subset sizes.
  • Evaluation scheduling can be treated as an explicit, optimizable stage in prompt optimization pipelines rather than an afterthought.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The discrimination-plus-coverage logic could transfer to other settings where each iteration requires costly scoring against a large data pool, such as active learning loops or iterative model selection.
  • If submodularity survives more aggressive adaptive policies, it would support online selection algorithms that react to prompt-performance signals in real time without sacrificing approximation bounds.
  • Token savings of this magnitude could be reinvested to enlarge the search space of prompt candidates or to run longer optimization trajectories on the same hardware budget.

Load-bearing premise

The objective remains monotone submodular after the discrimination, coverage, and cost terms are combined and after the adaptive controller adjusts the exploration-exploitation balance.

What would settle it

Compare the final prompt accuracy obtained when using POES-selected subsets against accuracy obtained when using randomly selected subsets of identical size on a new task; absence of a consistent advantage would falsify the benefit of the submodular scheduling approach.

Figures

Figures reproduced from arXiv: 2604.11328 by Haoyue Liu, Xiaoying Tang, Xiaoyu Ma, Ye Chen, Yiwen Li, Yongxin Guo, Zhichao Wang.

Figure 1
Figure 1. Figure 1: (a) Optimization curves on BBH Navigate: static baselines plateau while POES (red) con￾tinues improving via prompt-aware subset adaptation, achieving +8.3% over the best baseline. (b) Accuracy vs. token consumption: POES at k=20 dominates all baselines (high accuracy, moderate cost); at k=10 it stays close to baseline accuracy (0.804 vs. 0.820) with 34% fewer tokens. The evaluation subset bottleneck. Most … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the POES framework. The scheduler (dashed box) integrates five compo [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-task accuracy comparison on rank-1 tasks (OPRO [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Analysis and ablation. (a) POES outperforms the baseline average on 86% of tasks (30W/1T/5L). (b) Improvement correlates negatively with baseline accuracy (r=−0.40, p=0.016): gains are largest on harder tasks. (c) Waterfall decomposes the 12.8pp ablation gain: adaptive (+6.5pp), warmup (+5.5pp), coverage (+0.8pp). (d) Warm-start matches or exceeds cold-start on all 4 tasks (+9% BBH Navigate, +11% BB Naviga… view at source ↗
Figure 5
Figure 5. Figure 5: Hyperparameter sensitivity. (a) Test accuracy across 8 benchmarks (faded lines) and their average (dark, ±1 SEM) as a function of candidate pool size k; performance plateaus beyond k=20. (b) Optimization token cost scales linearly with k; the shaded region marks diminishing returns. (c) IRT model comparison: the 1PL model (blue) outperforms the 2PL model (red) on average (+2.3pp), with the largest gap on D… view at source ↗
Figure 6
Figure 6. Figure 6: Optimization curves on 8 representative tasks. POES (red) reaches the highest final score [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation optimization curves (best seed) on three tasks. POES (red) reaches the highest or [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-task rank distribution (violin plot) for each scheduling method. Lower rank is better. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
read the original abstract

Automatic prompt optimization (APO) hinges on the quality of its evaluation signal, yet scoring every prompt candidate on the full training set is prohibitively expensive. Existing methods either fix a single evaluation subset before optimization begins (principled but prompt-agnostic) or adapt it heuristically during optimization (flexible but unstable and lacking formal guarantees). We observe that APO naturally maps to an online adaptive testing problem: prompts are examinees, training examples are test items, and the scheduler should select items that best discriminate among the strongest candidates. This insight motivates Prompt-Aware Online Evaluation Scheduling (POES), which integrates an IRT-based discrimination utility, a facility-location coverage term, and switching-cost-aware warm-start swaps into a unified objective that is provably monotone submodular, yielding a (1-1/e) greedy guarantee for cold starts and bounded drift for warm-start updates. An adaptive controller modulates the exploration-exploitation balance based on optimization progress. Across 36 tasks spanning three benchmark families, POES achieves the highest overall average accuracy (6.2 percent improvement over the best baseline) with negligible token overhead (approximately 4 percent) at the same evaluation budget. Moreover, principled selection at k = 20 examples matches or exceeds the performance of naive evaluation at k = 30-50, reducing token consumption by 35-60 percent, showing that selecting smarter is more effective than selecting more. Our results demonstrate that evaluation scheduling is a first-class component of APO, not an implementation detail.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that by framing automatic prompt optimization (APO) as an online adaptive testing problem, one can design Prompt-Aware Online Evaluation Scheduling (POES) using a composite objective of IRT discrimination, facility location coverage, and switching costs that is monotone submodular. This yields a (1-1/e) greedy guarantee and allows an adaptive controller. Experiments across 36 tasks show POES achieves 6.2% higher average accuracy than the best baseline with ~4% token overhead, and that k=20 principled selection outperforms naive k=30-50, saving tokens.

Significance. Should the submodularity property be established, this provides a principled, guaranteed-efficient method for evaluation in APO, which is a key bottleneck. The empirical demonstration of performance gains and token reduction at fixed budget underscores the value of smart scheduling over simply using more examples. It positions evaluation scheduling as central to APO rather than an afterthought.

major comments (2)
  1. [Abstract] The assertion that the unified objective is 'provably monotone submodular' yielding the (1-1/e) guarantee is made without any proof sketch, derivation, or verification of submodularity preservation after combining terms and under adaptive modulation. This is load-bearing for the theoretical justification of POES over heuristics.
  2. [Results section] The reported 6.2% average accuracy improvement and token savings lack details on statistical controls, number of runs, variance, or precise baseline implementations, making it difficult to assess the robustness of the empirical claims.
minor comments (1)
  1. Clarify the specific benchmark families and tasks used in the 36-task evaluation for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps strengthen the presentation of both the theoretical guarantees and empirical results for POES. We address each major comment point by point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] The assertion that the unified objective is 'provably monotone submodular' yielding the (1-1/e) guarantee is made without any proof sketch, derivation, or verification of submodularity preservation after combining terms and under adaptive modulation. This is load-bearing for the theoretical justification of POES over heuristics.

    Authors: The full manuscript (Section 3) establishes monotonicity and submodularity separately for the IRT discrimination utility, the facility-location coverage term, and the switching-cost penalty; it then proves that their non-negative linear combination remains monotone submodular and that the adaptive controller induces only bounded drift, preserving the (1-1/e) greedy guarantee for cold-start selection. Because the abstract is space-constrained, we omitted an explicit sketch there. In the revision we will insert a concise two-sentence proof outline immediately after the claim in the abstract and add a pointer to the full derivation in Section 3. revision: yes

  2. Referee: [Results section] The reported 6.2% average accuracy improvement and token savings lack details on statistical controls, number of runs, variance, or precise baseline implementations, making it difficult to assess the robustness of the empirical claims.

    Authors: We agree that additional statistical detail is warranted. The 6.2 % figure is the mean improvement across 36 tasks, each evaluated with 5 independent random seeds; standard deviations and 95 % confidence intervals will be reported in the revised results tables. We will also expand the experimental-setup subsection to specify exact baseline configurations (including prompt-selection heuristics, evaluation budgets, and hyper-parameters) and to confirm that all methods were run under identical token budgets and model checkpoints. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs the POES objective by combining standard external components (IRT discrimination utility, facility-location coverage, switching-cost warm-start swaps) into a unified function asserted to be monotone submodular, yielding the (1-1/e) greedy guarantee. This is not self-definitional, as the submodularity is claimed to follow from the properties of the combined terms rather than being defined in terms of the target APO accuracy or fitted parameters. No load-bearing self-citations, ansatz smuggling, or renaming of known results appear in the abstract or described chain. Experiments across 36 tasks supply independent empirical support. The adaptive modulation is stated to preserve the property without reducing the central claim to a tautology or input fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all modeling assumptions are implicit in the IRT and submodularity claims.

pith-pipeline@v0.9.0 · 5595 in / 1238 out tokens · 37132 ms · 2026-05-10T15:42:59.960255+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    Large language models are human-level prompt engineers, 2023

    Y Zhou, AI Muresanu, Z Han, K Paster, S Pitis, H Chan, and J Ba. Large language models are human-level prompt engineers (arxiv: 2211.01910). arxiv, 2023

  2. [2]

    Large language models as optimizers

    Chengrun Y ang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. In The Twelfth International Conference on Learning Representations, 2023

  3. [3]

    Quantifying language models’ sensitivity to spurious features in prompt design.arXiv preprint arXiv:2310.11324, 2023

    Melanie Sclar, Y ejin Choi, Y ulia Tsvetkov, and Alane Suhr. Quantifying language models sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arxiv 2310.11324 [preprint] https://arxiv. org/abs/2310.11324. published october 17, 2023. Accessed January, 2024

  4. [4]

    Language mod- els are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language mod- els are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

  5. [5]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022

  6. [6]

    Automatic prompt optimization with gradient descent and beam search

    Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with gradient descent and beam search. In Proceedings of the 2023 con- ference on empirical methods in natural language processing , pages 7957–7968, 2023

  7. [7]

    Evoprompt: Connecting large language models with evolutionary algorithms for prompt engineering

    Q Guo, R Wang, J Wang, B Li, K He, X Tan, J Bian, and Y Zheng. Evoprompt: Connecting large language models with evolutionary algorithms for prompt engineering. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria , pages 7–11, 2024

  8. [8]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri V ardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023

  9. [9]

    Submodular eval- uation subset selection in automatic prompt optimization

    Jinming Nian, Zhiyuan Peng, Hongwei Shang, Dae Hoon Park, and Yi Fang. Submodular eval- uation subset selection in automatic prompt optimization. arXiv preprint arXiv:2601.03493 , 2026

  10. [10]

    Model performance-guided evaluation data selection for effective prompt optimization

    Ximing Dong, Shaowei Wang, Dayi Lin, and Ahmed Hassan. Model performance-guided evaluation data selection for effective prompt optimization. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 2844–2859, 2025. 11

  11. [11]

    Grips: Gradient-free, edit-based instruction search for prompting large language models

    Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal. Grips: Gradient-free, edit-based instruction search for prompting large language models. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages 3845–3864, 2023

  12. [12]

    TextGrad: Automatic "Differentiation" via Text

    Mert Y uksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496, 2024

  13. [13]

    Promptbreeder: Self-referential self-improvement via prompt evolution, 2023

    Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023

  14. [14]

    Instructzero: Efficient instruction optimization for black-box large language models

    Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou. Instructzero: Efficient instruction optimization for black-box large language models. arXiv preprint arXiv:2306.03082, 2023

  15. [15]

    arXiv preprint arXiv:2310.16427 , year=

    Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P Xing, and Zhiting Hu. Promptagent: Strategic planning with language models enables expert-level prompt optimization. arXiv preprint arXiv:2310.16427, 2023

  16. [16]

    Prompt optimization with ease? efficient ordering- aware automated selection of exemplars

    Zhaoxuan Wu, Xiaoqiang Lin, Zhongxiang Dai, Wenyang Hu, Y ao Shu, See-Kiong Ng, Patrick Jaillet, and Bryan Kian Hsiang Low. Prompt optimization with ease? efficient ordering- aware automated selection of exemplars. Advances in Neural Information Processing Systems , 37:122706–122740, 2024

  17. [17]

    Efficient prompt optimization through the lens of best arm identification

    Chengshuai Shi, Kun Y ang, Zihan Chen, Jundong Li, Jing Y ang, and Cong Shen. Efficient prompt optimization through the lens of best arm identification. Advances in Neural Informa- tion Processing Systems, 37:99646–99685, 2024

  18. [18]

    Adaptive prompt structure factorization: A framework for self-discovering and optimizing composi- tional prompt programs, 2026

    Haoyue Liu, Zhichao Wang, Y ongxin Guo, Haoran Shou, and Xiaoying Tang. Adaptive prompt structure factorization: A framework for self-discovering and optimizing composi- tional prompt programs, 2026

  19. [19]

    Applications of item response theory to practical testing problems

    Frederic M Lord. Applications of item response theory to practical testing problems . Rout- ledge, 2012

  20. [20]

    Computerized adaptive testing: Theory and practice, volume 13

    Wim J V an der Linden, Cees AW Glas, et al. Computerized adaptive testing: Theory and practice, volume 13. Springer, 2000

  21. [21]

    tinyBenchmarks : evaluating LLMs with fewer examples

    Felipe Maia Polo, Lucas Weber, Leshem Choshen, Y uekai Sun, Gongjun Xu, and Mikhail Y urochkin. tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992, 2024

  22. [22]

    metabench– a sparse benchmark of reasoning and knowledge in large language models

    Alex Kipnis, Konstantinos V oudouris, Luca M Schulze Buschoff, and Eric Schulz. metabench– a sparse benchmark of reasoning and knowledge in large language models. arXiv preprint arXiv:2407.12844, 2024

  23. [23]

    Item response theory in ai: Analysing machine learning classifiers at the instance level

    Fernando Martínez-Plumed, Ricardo BC Prudêncio, Adolfo Martínez-Usó, and José Hernández-Orallo. Item response theory in ai: Analysing machine learning classifiers at the instance level. Artificial intelligence, 271:18–42, 2019

  24. [24]

    Position: Ai evaluation should learn from how we test humans

    Y an Zhuang, Qi Liu, Zachary Pardos, Patrick C Kyllonen, Jiyun Zu, Zhenya Huang, Shijin Wang, and Enhong Chen. Position: Ai evaluation should learn from how we test humans. In F orty-second International Conference on Machine Learning Position Paper Track, 2025

  25. [25]

    An analysis of approxima- tions for maximizing submodular set functionsi

    George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approxima- tions for maximizing submodular set functionsi. Mathematical programming, 14(1):265–294, 1978

  26. [26]

    Submodular function maximization

    Andreas Krause and Daniel Golovin. Submodular function maximization. Tractability, 3(71- 104):3, 2014

  27. [27]

    Lazier than lazy greedy

    Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan V ondrák, and An- dreas Krause. Lazier than lazy greedy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015

  28. [28]

    An online algorithm for maximizing submodular func- tions

    Matthew Streeter and Daniel Golovin. An online algorithm for maximizing submodular func- tions. Advances in Neural Information Processing Systems , 21, 2008. 12

  29. [29]

    Online submodular maximization under a matroid constraint with application to learning assignments.arXiv preprint arXiv:1407.1082, 2014

    Daniel Golovin, Andreas Krause, and Matthew Streeter. Online submodular maximiza- tion under a matroid constraint with application to learning assignments. arXiv preprint arXiv:1407.1082, 2014

  30. [30]

    Information complexity in bandit subset selection

    Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset selection. In Conference on Learning Theory , pages 228–251. PMLR, 2013

  31. [31]

    Bandits with switching costs: T 2/3 regret

    Ofer Dekel, Jian Ding, Tomer Koren, and Y uval Peres. Bandits with switching costs: T 2/3 regret. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing , pages 459–467, 2014

  32. [32]

    A 2-competitive algorithm for online convex optimization with switching costs

    Nikhil Bansal, Anupam Gupta, Ravishankar Krishnaswamy, Kirk Pruhs, Kevin Schewior, and Cliff Stein. A 2-competitive algorithm for online convex optimization with switching costs. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2015), pages 96–109. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2015

  33. [33]

    Coresets for data-efficient training of machine learning models

    Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning , pages 6950–

  34. [34]

    D2 pruning: Message passing for balancing diversity and difficulty in data pruning

    Adyasha Maharana, Prateek Y adav, and Mohit Bansal. D2 pruning: Message passing for balancing diversity and difficulty in data pruning. arXiv preprint arXiv:2310.07931, 2023

  35. [35]

    Curriculum learning

    Y oshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning , ICML ’09, page 4148, New Y ork, NY , USA, 2009. Association for Computing Machinery

  36. [36]

    Active Learning for Convolutional Neural Networks: A Core-Set Approach

    Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017

  37. [37]

    arXiv preprint arXiv:1906.03671 , year=

    Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agar- wal. Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671, 2019

  38. [38]

    Glister: Generalization based data subset selection for efficient and robust learning

    Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer. Glister: Generalization based data subset selection for efficient and robust learning. In Pro- ceedings of the AAAI conference on artificial intelligence , volume 35, pages 8110–8118, 2021

  39. [39]

    Selec- tion via proxy: Efficient data selection for deep learning

    Cody Coleman, Christopher Y eh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data selection for deep learning. arXiv preprint arXiv:1906.11829, 2019

  40. [40]

    Deep learning on a data diet: Finding important examples early in training

    Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. Advances in neural information processing systems, 34:20596–20607, 2021

  41. [41]

    Monotone submodular maximization over a matroid via non- oblivious local search

    Y uval Filmus and Justin Ward. Monotone submodular maximization over a matroid via non- oblivious local search. SIAM Journal on Computing , 43(2):514–542, 2014

  42. [42]

    Maximizing non-monotone submodular functions

    Uriel Feige, V ahab S Mirrokni, and Jan V ondrák. Maximizing non-monotone submodular functions. SIAM Journal on Computing , 40(4):1133–1153, 2011

  43. [43]

    Challenging big-bench tasks and whether chain-of-thought can solve them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Compu- tational Linguistics: ACL 2023 , pages 13003–13051, 2023

  44. [44]

    Be- yond the imitation game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Be- yond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on machine learning research, 2023

  45. [45]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  46. [46]

    worker-as-judge

    Rishi Bommasani, Percy Liang, and Tony Lee. Holistic evaluation of language models. Annals of the New York Academy of Sciences , 1525(1):140–146, 2023. 13 Contents of Appendix: Section A Algorithm Pseudocode · Section B Complete Proofs · Section C Experiment Settings · Section D Dataset Details · Section E Additional Experiments · Section E.3 Parameter Se...

  47. [47]

    Noise reduction: A well-chosen subset can filter out noisy or uninformative examples that dilute the evaluation signal

  48. [48]

    Focus effect: By concentrating evaluation on discriminative examples, the optimizer receives sharper feedback about which prompts are truly better

  49. [49]

    calculate the final coordinates

    Budget reallocation: The cost savings from subset evaluation can be reinvested into more opti- mization steps or more prompt candidates per step. This phenomenon is analogous to the data pruning literature [ 34], where training on a carefully selected subset can match or exceed full-data training. I Broader Impact This work addresses evaluation subset sch...

  50. [50]

    During the initial rounds, POES uses a random subset identical to the Random baseline

    Warmup provides a stable foundation. During the initial rounds, POES uses a random subset identical to the Random baseline. This is deliberate: the IRT model requires a minimum number 22 Table 11: Qualitative evolution of the POES evaluation subset on BBH Navigate (seed 44). During warmup, the subset is random; after warmup exit, it is actively refined via...

  51. [51]

    Transition to active scheduling is data-driven. The warmup-to-active transition is triggered when the discrimination ratio exceeds the exit threshold ρexit, indicating that at least some ex- amples have become meaningfully more informative than the average. On BBH Navigate, this typically occurs at round 2–3

  52. [52]

    After warmup exit, the subset evolves gradually: the swap budget Bt limits the number of items that can change per round (typically 2–4 out of k=20)

    Bounded swaps ensure stability. After warmup exit, the subset evolves gradually: the swap budget Bt limits the number of items that can change per round (typically 2–4 out of k=20). This prevents the erratic subset changes observed with IPOMP , which can replace up to 100% of the subset in a single round

  53. [53]

    discriminative

    Contrast with static methods. Random and SESS both use a fixed subset from round 1 through the final round. While SESS’s subset is more principled (selected via submodular optimization over embedding diversity), it cannot adapt to the changing prompt population. As optimization progresses and top prompts converge, the discriminative examples shift—but stati...

  54. [54]

    Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Y es] Justification: The abstract and introduction clearly state our three contributions (formu- lation, algorithm, experiments) and the experimental claims are supported by results in Section 4

  55. [55]

    2PL trade-off, scaling to larger pools and generation tasks, and reduced gains when all prompts already perform similarly

    Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Y es] Justification: Section 6 discusses four specific limitations: lack of end-to-end APO conver- gence guarantees, the 1PL model simplicity vs. 2PL trade-off, scaling to larger pools and generation tasks, and reduced gains when all prompts already p...

  56. [56]

    Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (or correct) proof? Answer: [Y es] Justification: All four propositions are formally stated in Section 3.6 with complete proofs in Section B

  57. [57]

    Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclu- sions of the paper? Answer: [Y es] Justification: Section C provides complete hyperparameter configurations, Section D de- scribes all data...

  58. [58]

    All datasets used are publicly available benchmarks (BBH, BigBench, MMLU, GSM8K, MA TH, MultiArith)

    Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results? Answer: [Y es] Justification: Code and data will be released upon acceptance. All datasets used are publicly available benchmarks (BBH, BigBench, MMLU, GSM8K, MA TH, MultiArith)

  59. [59]

    Experimental Setting/Details Question: Does the paper specify all the training and test details necessary to understand the results? Answer: [Y es] Justification: Section 4 describes the experimental setup, Section C provides all hyperpa- rameters (Table 4), and Section D details all benchmark configurations

  60. [60]

    The scheduler diagnostics in Section F report means with stan- dard deviations

    Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly? Answer: [Y es] Justification: All experiments are run with multiple random seeds and main-table results report cross-seed averages. The scheduler diagnostics in Section F report means with stan- dard deviations

  61. [61]

    Experiments Compute Resources 26 Question: For each experiment, does the paper provide sufficient information on the com- puter resources needed to reproduce the experiments? Answer: [Y es] Justification: Section C specifies GPU types (NVIDIA A100-80GB), model serving details (vLLM), and Table 3 reports token consumption and wall-clock time for all methods

  62. [62]

    Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics? Answer: [Y es] Justification: This work focuses on evaluation scheduling for prompt optimization and does not involve human subjects, deception, or harmful applications

  63. [63]

    Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Y es] Justification: Section I discusses both positive impacts (reduced computational cost/carbon footprint) and potential risks (lowering barriers to adversarial prompt engineering) with appropriate mitigations

  64. [64]

    Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models with a high risk for misuse? Answer: [NA] Justification: This work releases a scheduling algorithm, not a trained model or dataset with misuse risk

  65. [65]

    All are publicly available under permissive licenses

    Licenses for existing assets Question: Are the creators or original owners of assets used in the paper properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Y es] Justification: All benchmarks (BBH, BigBench, MMLU, GSM8K, MA TH, MultiArith) and models (Llama-3.1-8B) are properly cited. All are publicly...

  66. [66]

    New Assets Question: Are new assets introduced in the paper well documented and is the documenta- tion provided alongside the assets? Answer: [Y es] Justification: Our code release will include documentation, configuration files, and instruc- tions for reproducing all experiments

  67. [67]

    Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the pa- per include the full text of instructions given to participants? Answer: [NA] Justification: This work does not involve crowdsourcing or human subjects

  68. [68]

    Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants? Answer: [NA] Justification: This work does not involve human subjects research

  69. [69]

    Declaration of LLM usage 27 Question: Does the paper describe the usage of LLMs in the core methodology? Answer: [Y es] Justification: Section 3 and Section 4 fully describe the use of LLMs (Llama-3.1-8B as worker, GPT-OSS-120B as meta-optimizer) including model configurations and API de- tails. 28