arxiv: 2605.14553 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits

Donghao Li , Chengshuai Shi , Weijuan Ou , Cong Shen , Jing Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multi-objective prompt optimizationpure-exploration banditslarge language modelsPareto set recoverybest arm identificationstructured bandits

0 comments

The pith

Multi-objective prompt selection for large language models reduces to pure-exploration bandit problems, enabling efficient algorithms with theoretical guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that selecting prompts to optimize several performance measures at once can be handled by modeling each prompt as an arm in a pure-exploration bandit game. In this setting, the algorithm pulls arms to learn their reward vectors across objectives without needing to commit to a single best arm early. It adapts existing multi-objective bandit methods and adds a new procedure for finding the best feasible arm under linear structure, with error bounds in the linear case. Experiments on several LLMs demonstrate that these bandit methods recover better prompt sets than standard baselines. This matters because prompt engineering often involves trade-offs between accuracy, efficiency, and other factors that a single score cannot capture.

Core claim

By casting multi-objective prompt selection as a pure-exploration bandits problem, provably efficient algorithms from multi-objective bandits can be adapted, along with a novel design for best feasible arm identification in structured bandits that provides theoretical guarantees on identification error in the linear case.

What carries the argument

The pure-exploration bandits framework, specifically adapted multi-objective bandit algorithms and a novel best feasible arm identification method for structured bandits.

If this is right

Bandit-based methods recover the Pareto set of prompts more efficiently than baselines.
Identification of the best feasible prompt achieves lower error rates under linear reward structures.
Performance improves across multiple LLMs in empirical tests for both Pareto recovery and feasible prompt identification.
Provides a principled framework that avoids exhaustive enumeration of prompt candidates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar bandit reductions could apply to other multi-criteria optimization tasks in machine learning beyond prompts.
Handling non-stationarity in LLM responses might require extensions to the bandit model for real-world deployment.
Scalability to higher-dimensional objective spaces remains an open question that the linear case bounds could help address.

Load-bearing premise

Prompt performance across multiple objectives can be modeled as rewards from independent or linearly structured arms in a pure-exploration bandit setting without significant interference or non-stationarity from LLM stochasticity.

What would settle it

An experiment where the same prompt produces highly variable and time-varying performance across runs on the same LLM, causing the bandit algorithm to fail to identify the true Pareto front or best feasible prompt within the predicted sample budget.

Figures

Figures reproduced from arXiv: 2605.14553 by Chengshuai Shi, Cong Shen, Donghao Li, Jing Yang, Weijuan Ou.

**Figure 2.** Figure 2: Average soft constrained reward vs. per-arm budget. Error bars denote 95% confidence [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Hypervolume recovery vs. per-arm budget. Error bars denote 95% confidence intervals [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: The system prompt template is used for both Gemma and Llama3. [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt generation with 3-5 examples. C.2 METRICS USED IN SUMMARIZATION TASKS In our experiments, we mainly consider two metrics for the summarization task, the ROUGE and Brevity score. ROUGE Score. The ROUGE (Lin, 2004) score is a widely used metric for the summarization task. It captures the token-wise similarity between two texts. We use its variant, ROUGE-Lsum, provided by the Hugging Face evaluate pyth… view at source ↗

**Figure 6.** Figure 6: Distributions of the objectives for generated prompts on Xsum. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Average recovered hypervolume (%) and constrained reward on XSum using Llama-3. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Average recovered hypervolume (%) and average constrained reward (%) with 1000 can [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Feasible average reward vs. per-armed budget on XSum using Llama3 with varying [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

read the original abstract

Prompt engineering has become central to eliciting the capabilities of large language models (LLMs). At its core lies prompt selection -- efficiently identifying the most effective prompts. However, most prior investigations overlook a key challenge: the inherently multi-faceted nature of prompt performance, which cannot be captured by a single metric. To fill this gap, we study the multi-objective prompt selection problem under two practical settings: Pareto prompt set recovery and best feasible prompt identification. Casting the problem into the pure-exploration bandits framework, we adapt provably efficient algorithms from multi-objective bandits and further introduce a novel design for best feasible arm identification in structured bandits, with theoretical guarantees on the identification error in the linear case. Extensive experiments across multiple LLMs show that the bandit-based approaches yield significant improvements over baselines, establishing a principled and efficient framework for multi-objective prompt optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper frames multi-objective prompt selection for LLMs as a pure-exploration bandit problem. It adapts existing multi-objective bandit algorithms for Pareto prompt set recovery and introduces a novel algorithm for best feasible arm identification in structured bandits, supplying theoretical identification-error bounds under linear reward assumptions. Experiments on multiple LLMs report that the bandit methods yield significant improvements over baselines.

Significance. If the linearity assumption holds and the bounds are correctly derived, the work supplies a principled, efficient framework with theoretical backing for multi-objective prompt optimization, moving beyond heuristic methods. The adaptation of established algorithms plus the novel structured-bandit design, paired with empirical gains, would represent a useful contribution to LLM prompt engineering.

major comments (2)

[Abstract / theoretical section] Abstract and theoretical development: the identification-error bounds for the novel best-feasible-arm algorithm are stated to hold only in the linear case, yet the manuscript provides no verification that multi-objective LLM rewards (accuracy, coherence, safety, etc.) are realizable as linear functions of any fixed prompt feature map. This assumption is load-bearing for the central theoretical claim.
[Experiments] Experimental section: reported improvements over baselines are presented without error-bar details, data-exclusion rules, or any diagnostic checking whether observed gains remain consistent when the linear reward structure is relaxed or violated by LLM stochasticity and non-linear prompt effects.

minor comments (1)

[Introduction / Modeling] The abstract refers to 'the linear case' without an early, explicit definition of the feature map or the precise linearity assumption; this should be stated clearly in the modeling section before the algorithm is introduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating planned revisions to strengthen the presentation of assumptions and experimental details.

read point-by-point responses

Referee: [Abstract / theoretical section] Abstract and theoretical development: the identification-error bounds for the novel best-feasible-arm algorithm are stated to hold only in the linear case, yet the manuscript provides no verification that multi-objective LLM rewards (accuracy, coherence, safety, etc.) are realizable as linear functions of any fixed prompt feature map. This assumption is load-bearing for the central theoretical claim.

Authors: We thank the referee for highlighting this point. The manuscript explicitly states that the identification-error bounds hold under the linear reward assumption (see abstract and Theorem 4). This is a standard modeling choice in structured bandits that enables finite-sample guarantees, similar to linear contextual bandits. We do not claim that LLM rewards are exactly linear in a fixed feature map, nor do we provide a formal verification of this realizability, as such verification lies outside the paper's scope. In the revision we will add a paragraph in the discussion section clarifying the assumption, its practical implications, and possible non-linear extensions. revision: partial
Referee: [Experiments] Experimental section: reported improvements over baselines are presented without error-bar details, data-exclusion rules, or any diagnostic checking whether observed gains remain consistent when the linear reward structure is relaxed or violated by LLM stochasticity and non-linear prompt effects.

Authors: We agree that the experimental section requires additional rigor. In the revised manuscript we will report error bars (standard deviations over multiple independent runs) for all metrics, explicitly state data-exclusion rules (none beyond standard invalid-prompt filtering), and add diagnostic experiments assessing robustness when the linear structure is relaxed, including simulations with added non-linearity and higher stochasticity. revision: yes

Circularity Check

0 steps flagged

Adaptation of established multi-objective bandit algorithms with novel linear-case extension; no reduction to self-fitted parameters or self-citation chains

full rationale

The derivation casts prompt selection as a pure-exploration bandit problem and directly adapts existing provably efficient algorithms from the multi-objective bandits literature while adding a novel design only for the linear structured case. The identification-error bounds are stated to hold under the linear reward assumption and are not derived from parameters fitted inside this paper. No step equates a claimed prediction to an input by construction, and the central guarantees rest on external bandit theory rather than self-citation load-bearing or ansatz smuggling. Empirical results demonstrate improvement but do not alter the non-circular status of the theoretical chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard multi-objective bandit assumptions plus a linear structure assumption invoked for the new identification guarantees; no free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Prompt performances across objectives behave as rewards from arms that can be explored in a pure-exploration multi-objective bandit setting
This modeling choice is required to cast prompt selection as the bandit problem solved by the adapted algorithms.
domain assumption Linear structure holds for the best feasible arm identification guarantees
Explicitly stated as the setting in which theoretical error bounds are provided.

pith-pipeline@v0.9.0 · 5445 in / 1342 out tokens · 49757 ms · 2026-05-15T01:47:29.025982+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 4 internal anchors

[1]

Gepa: Reflective prompt evolution can outperform reinforcement learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning. InNeurIPS 2025 Workshop on Foundations of Reasoning in Language Models,

work page 2025
[2]

Best arm identification in multi-armed bandits

Jean-Yves Audibert and S ´ebastien Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010, pp. 13–p,

work page 2010
[3]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[5]

Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou

URLhttp://arxiv.org/abs/1812.01097. Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou. Instructzero: efficient instruction optimization for black-box large language models6518. InProceedings of the 41st International Conference on Machine Learning, pp. 6503–6518,

work page arXiv
[6]

Discrete prompt optimization via constrained generation for zero-shot re-ranker

Sukmin Cho, Soyeong Jeong, Jeong yeon Seo, and Jong C Park. Discrete prompt optimization via constrained generation for zero-shot re-ranker. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 960–971,

work page 2023
[7]

Rlprompt: Optimizing discrete text prompts with reinforcement learning

Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3369–3391,

work page 2022
[8]

Constrained pure exploration multi-armed bandits with a fixed budget.arXiv preprint arXiv:2211.14768,

10 Published as a conference paper at ICLR 2026 Fathima Zarin Faizal and Jayakrishnan Nair. Constrained pure exploration multi-armed bandits with a fixed budget.arXiv preprint arXiv:2211.14768,

work page arXiv 2026
[9]

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. InInternational Conference on Learning Representations, volume 2024, pp. 34133–34156,

work page 2024
[10]

Morl-prompt: An em- pirical analysis of multi-objective reinforcement learning for discrete prompt optimization

Yasaman Jafari, Dheeraj Mekala, Rose Yu, and Taylor Berg-Kirkpatrick. Morl-prompt: An em- pirical analysis of multi-objective reinforcement learning for discrete prompt optimization. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 9878–9889,

work page 2024
[11]

Prompt-rag: Pioneering vector embedding- free retrieval-augmented generation in niche domains, exemplified by korean medicine,

Bongsu Kang, Jundong Kim, Tae-Rim Yun, and Chang-Eop Kim. Prompt-rag: Pioneering vec- tor embedding-free retrieval-augmented generation in niche domains, exemplified by korean medicine.arXiv preprint arXiv:2401.11246,

work page arXiv
[12]

Bounded archiving using the lebesgue measure

Joshua D Knowles, David W Corne, and Mark Fleischer. Bounded archiving using the lebesgue measure. InThe 2003 Congress on Evolutionary Computation,

work page 2003
[13]

Bandit Pareto set identification in a multi- output linear model

Cyrille Kone, Emilie Kaufmann, and Laura Richert. Bandit Pareto set identification in a multi- output linear model. InAISTATS 2025-8th International Conference on Artificial Intelligence and Statistics,

work page 2025
[14]

Meta-prompt optimization for LLM- based sequential decision making

Mingze Kong, Zhiyong Wang, Yao Shu, and Zhongxiang Dai. Meta-prompt optimization for LLM- based sequential decision making. InICLR 2025 Workshop on Reasoning and Planning for Large Language Models,

work page 2025
[15]

Prompt optimization with human feedback

Xiaoqiang Lin, Zhongxiang Dai, Arun Verma, See-Kiong Ng, Patrick Jaillet, and Bryan Kian Hsiang Low. Prompt optimization with human feedback. InICML 2024 Workshop on Models of Human Feedback for AI Alignment, 2024a. Xiaoqiang Lin, Zhaoxuan Wu, Zhongxiang Dai, Wenyang Hu, Yao Shu, See-Kiong Ng, Patrick Jail- let, and Bryan Kian Hsiang Low. Use your INSTINC...

work page 2024
[16]

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.ArXiv, abs/1808.08745,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

gradient descent

Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search. InProceedings of the 2023 conference on empirical methods in natural language processing, pp. 7957–7968,

work page 2023
[18]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques and applica- tions.arXiv preprint arXiv:2402.07927,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, et al. The prompt report: a system- atic survey of prompt engineering techniques.arXiv preprint arXiv:2406.06608,

work page internal anchor Pith review arXiv
[20]

Autoprompt: Eliciting knowledge from language models with automatically generated prompts

Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp. 4222– 4235,

work page 2020
[21]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,

12 Published as a conference paper at ICLR 2026 Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,

work page 2026
[23]

A survey of attacks on large language models.arXiv preprint arXiv:2505.12567,

Wenrui Xu and Keshab K Parhi. A survey of attacks on large language models.arXiv preprint arXiv:2505.12567,

work page arXiv
[24]

Backdooring instruction-tuned large language models with virtual prompt injection

Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. Backdooring instruction-tuned large language models with virtual prompt injection. InProceedings of the 2024 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies (Volume 1: ...

work page 2024
[25]

Instoptima: Evolutionary multi-objective instruction optimization via large language model-based instruction operators

Heng Yang and Ke Li. Instoptima: Evolutionary multi-objective instruction optimization via large language model-based instruction operators. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 13593–13602,

work page 2023
[26]

13 Published as a conference paper at ICLR 2026 APPENDIX A BEST FEASIBLE PROMPT IDENTIFICATION A.1 SETTINGS AND NOTATIONS In this section, we formally present the Linear Constrained Sequential Halving in Algorithm

work page 2026
[27]

For the multi-objective bandit setting, we have the prompt or arm setXwith|X |=K, and the expected performance or reward vector µ(x), x∈ X

We first recap the linear constrained bandit setting and notations. For the multi-objective bandit setting, we have the prompt or arm setXwith|X |=K, and the expected performance or reward vector µ(x), x∈ X. For each arm, there is an associated featureϕ(x)∈R d. In the linear constrained bandit setting, we assume the number of reward dimensions is only two...

work page 2022
[28]

The algorithm is instantiated from GENSECAlgorithm 1 by using the Sequential Halving (SH) scheduler, the G-optimal design allocator and the linear estimator. 14 Published as a conference paper at ICLR 2026 Algorithm 2LINEARCONSTRAINEDSEQUENTIALHALVING 1:Input:budgetB, thresholdτ, prompt setX, feature mapϕ(·), toleranceϵ, parameterκ∈ (0,1/3] 2:Initializati...

work page 2026
[29]

In roundr, given the active arm setA r−1, the arm pulled at steptsatisfiesx (t) ∈A r−1 and yields the observed outcomef (t)

1:Input:budgetN, active setA, feature mapϕ(·), toleranceϵ, parameterκ∈(0,1/3] 2:{w i}i∈A ←MD(A, ϕ, ϵ) 3:{N i}i∈A ←ROUND(N,{w i}i∈A, κ, A, ϕ) 4:fort= 1toNdo x(t) ←i, ift∈ Pi−1 j=1 Nj,Pi j=1 Nj i 5:end for 6:Output:{x (t)}t A.3.2 LINEARESTIMATOR For the linear case, the estimator only relies on the data collected in the current round to estimateθ. In roundr...

work page 2026
[30]

For arm1to be eliminated at the end of round r, the following needs to happen: Nr ≥l r, i.e., there are at leastl r arms ranked higher than arm1. We can bound the probability of this event as P(Nr ≥l r)≤ E[Nr] lr 17 Published as a conference paper at ICLR 2026 = 1 lr X x∈Ar−1,x̸=1 P(Gr(x)) ≤ 1 lr X x∈Ar−1,x̸=1 12 exp − anr∆(x)2 4dr ≤ 12lr−1 lr ·exp − anr ...

work page 2026
[31]

The proof is then concluded. Theorem 2.Given a fixed set ofKarmsXwith the linear reward structure specified in Equa- tions(5)and(6), under total budgetB≥45d⌈log 2 K⌉, the probability that Algorithm 2 fails to output the best feasible arm is upper bounded by P(1/∈AR)≤48⌈log 2 K⌉exp − a 4 · B ⌈log2 K⌉ · 1 dH , where a= 1 6σ2 , H= max x∈X \{1} 1 ∆2(x) , and∆...

work page 2016
[32]

For the instruction models, we adopt the recommended system template as Figure

and Llama3-8b-instruct (Grattafiori et al., 2024). For the instruction models, we adopt the recommended system template as Figure

work page 2024
[33]

It captures the token-wise similarity between two texts

score is a widely used metric for the summarization task. It captures the token-wise similarity between two texts. We use its variant, ROUGE-Lsum, provided by the Hugging Faceevaluatepython package. In general, it computes the longest com- mon subsequences between the generated text and the reference, and aggregates the results across sentences. It captur...

work page 2026
[34]

(a) Constraint = 0.6 (b) Constraint = 0.5 (c) Constraint = 0.4 Figure 9: Feasible average reward vs

It indicates that our proposed algorithms consistently outperform the uniform baseline, and the choice of constraint does not compromise the effectiveness of our algorithms. (a) Constraint = 0.6 (b) Constraint = 0.5 (c) Constraint = 0.4 Figure 9: Feasible average reward vs. per-armed budget on XSum using Llama3 with varying constraints. C.6 COMPUTING RESO...

work page 2026