LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization
Pith reviewed 2026-05-18 06:58 UTC · model grok-4.3
The pith
The Prompt Duel Optimizer finds better prompts by using LLM pairwise judgments in a dueling-bandit setup without any ground-truth labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PDO casts prompt optimization as a dueling-bandit problem and uses Double Thompson Sampling to prioritize informative pairwise comparisons from an LLM judge combined with top-performer guided mutation to expand and prune the candidate pool, enabling effective label-free optimization.
What carries the argument
Dueling-bandit formulation of prompt selection, where Double Thompson Sampling selects informative pairwise comparisons and top-performer guided mutation manages the candidate pool.
If this is right
- Under a fixed number of LLM judge comparisons, PDO returns higher-quality prompts than random selection or simpler ranking baselines.
- The method reduces reliance on expensive labeled validation sets for prompt engineering.
- Quality-cost trade-offs remain favorable when the total judge budget is severely limited.
Where Pith is reading between the lines
- The same comparison-based selection could be applied to optimize other LLM behaviors such as reasoning chains or tool-use formats.
- Combining PDO with occasional human judgments might reduce judge bias while keeping the overall budget low.
- The mutation step may surface interpretable patterns about what prompt changes most affect downstream performance.
Load-bearing premise
Pairwise preferences from an LLM judge are reliable enough to predict which prompt will actually perform better on the downstream task.
What would settle it
Evaluate the final prompts chosen by PDO and by baselines directly on the target tasks using ground-truth metrics; the performance gap should disappear or reverse if judge preferences are uncorrelated with real accuracy.
Figures
read the original abstract
Large language models (LLMs) are highly sensitive to prompts, but most automatic prompt optimization (APO) methods assume access to ground-truth references (e.g., labeled validation data) that are costly to obtain. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization based on pairwise preference feedback from an LLM judge. PDO casts prompt selection as a dueling-bandit problem and combines (i) Double Thompson Sampling to prioritize informative comparisons under a fixed judge budget, with (ii) top-performer guided mutation to expand the candidate pool while pruning weak prompts. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently identifies stronger prompts than label-free baselines, while offering favorable quality--cost trade-offs under constrained comparison budgets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Prompt Duel Optimizer (PDO), a label-free prompt optimization framework that casts prompt selection as a dueling-bandit problem. It combines Double Thompson Sampling to prioritize informative pairwise comparisons under a fixed LLM-judge budget with top-performer guided mutation to expand and prune the candidate pool. Experiments on BIG-bench Hard (BBH) and MS MARCO report that PDO identifies stronger prompts than label-free baselines while achieving favorable quality-cost trade-offs under constrained comparison budgets.
Significance. If the experimental support is robust, the work could meaningfully advance automatic prompt optimization by removing dependence on ground-truth labels. The integration of established dueling-bandit algorithms with mutation operators provides a concrete, budget-aware alternative to existing label-free methods and may influence practical LLM deployment pipelines where labeled validation data are scarce.
major comments (2)
- [Experiments] Experiments section: the central claim that PDO 'consistently identifies stronger prompts' and offers favorable quality-cost trade-offs rests on LLM-judge pairwise preferences serving as a reliable proxy for downstream task performance, yet no correlation statistics, calibration plots, or ablations against ground-truth accuracy/relevance on BBH or MS MARCO are supplied; without these the label-free setting remains unverified.
- [Method and Experiments] Method and Experiments: the manuscript applies Double Thompson Sampling and mutation operators but provides no details on the number of independent runs, statistical significance tests, or error bars for the reported outperformance; these omissions make it impossible to assess whether the observed gains exceed baseline variance under the stated comparison budgets.
minor comments (2)
- [Abstract] Abstract: the phrase 'favorable quality--cost trade-offs' is used without defining the precise quality and cost metrics (e.g., exact judge score vs. token count) that are later compared.
- [Method] Notation: the description of 'top-performer guided mutation' would benefit from an explicit pseudocode listing or equation showing how the mutation operator is conditioned on the current top-ranked prompts.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and commit to revisions that strengthen the experimental validation and reporting without altering the core contributions of the label-free optimization framework.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central claim that PDO 'consistently identifies stronger prompts' and offers favorable quality-cost trade-offs rests on LLM-judge pairwise preferences serving as a reliable proxy for downstream task performance, yet no correlation statistics, calibration plots, or ablations against ground-truth accuracy/relevance on BBH or MS MARCO are supplied; without these the label-free setting remains unverified.
Authors: We appreciate the referee's emphasis on verifying the LLM-judge proxy. While PDO performs optimization using only pairwise preferences, the final selected prompts are evaluated against ground-truth labels on BBH (accuracy) and MS MARCO (relevance) to quantify actual gains over baselines. To directly address this point, the revised manuscript will add Spearman rank correlations between aggregated judge preference scores and ground-truth metrics, calibration plots of judge reliability, and ablations showing alignment between judge-guided rankings and true performance under the fixed comparison budgets. These additions will be placed in the Experiments section to confirm the proxy's effectiveness. revision: yes
-
Referee: [Method and Experiments] Method and Experiments: the manuscript applies Double Thompson Sampling and mutation operators but provides no details on the number of independent runs, statistical significance tests, or error bars for the reported outperformance; these omissions make it impossible to assess whether the observed gains exceed baseline variance under the stated comparison budgets.
Authors: We agree that these experimental details are essential for assessing robustness. The revised version will explicitly state that all results are averaged over 5 independent runs with different random seeds, include standard deviation error bars in figures and tables, and report statistical significance via paired t-tests (or Wilcoxon signed-rank tests where appropriate) with p-values to show that gains exceed baseline variance under the constrained budgets. revision: yes
Circularity Check
No circularity: framework applies established dueling-bandit methods without self-referential reduction.
full rationale
The paper casts prompt selection as a dueling-bandit problem and invokes Double Thompson Sampling plus mutation operators drawn from prior literature. No equations or steps reduce the claimed performance gains to a quantity defined by a fitted parameter from the same data, nor does any load-bearing premise rest solely on self-citation. The derivation remains self-contained against external benchmarks (BBH, MS MARCO) and does not exhibit any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
free parameters (2)
- comparison budget
- mutation and sampling hyperparameters
axioms (1)
- domain assumption LLM judge pairwise preferences correlate with actual task performance
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PDO casts prompt selection as a dueling-bandit problem and combines (i) Double Thompson Sampling to prioritize informative comparisons under a fixed judge budget, with (ii) top-performer guided mutation
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formulate preference-based prompt optimization without ground-truth label references as a dueling bandit problem
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts
ReElicit uses LLMs to elicit adaptive feature embeddings for Gaussian process Bayesian optimization of system prompts under aggregate-only feedback, outperforming baselines across ten tasks with a 30-evaluation budget.
Reference graph
Works this paper leans on
-
[1]
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268. Viktor Bengs, R´ obert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke H¨ ullermeier. 2021. Preference-based online learning with dueling bandits: A survey.Journal of Machine Learning Research, 22(7):1–108. Open access. Yongchao Chen, Jacob Arkin, Yilun Hao, Ya...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
Snorkel: Rapid training data creation with weak supervision.Proceedings of the VLDB En- dowment, 11(3):269–282. Samuel Schulhoff and 1 others. 2024. A systematic survey of prompting techniques.arXiv preprint arXiv:2406.06608. Chengshuai Shi, Kun Yang, Jing Yang, and Cong Shen. 2024. Efficient prompt optimization through the lens of best arm identification...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Self-supervised prompt optimization.arXiv preprint arXiv:2502.06855, 2025
Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. InProceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 2609–2634, Toronto, Canada. Association for Computational Linguistics. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ic...
-
[4]
Large language models as optimizers. In ICLR. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica
-
[5]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685. Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large language models are human-level prompt engineers.arXiv preprint arXiv:2211.01910. Masrour Zoghi, Shimon Whiteson, R´ emi Munos, and Maarten de Rijke. 201...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Starting from |P| = 20 instructions, D- TS runs pairwise duels. For themutateset- ting, at rounds 10 and 20 we prune the 10 lowest–Copeland-score prompts and generate 10 new candidates by mutating the current Copeland leader, then continue dueling with the updated pool. In contrast, thenon mutate baseline keeps the initial 20 prompts fixed and selects lea...
work page 2024
-
[7]
is a large-scale question answering dataset built from real Bing search queries, paired with human-written answers and linked passages. It supports QA, passage ranking, and related IR/NLP tasks. In our setting, we focus on four task categories—Description, En- tity, Numeric, and Location—and adopt a 1–5 integer scoring scheme from an LLM judge that compar...
-
[8]
**Factual Accuracy** - Which answer better matches reality and task requirements?
-
[9]
**Task Alignment** - Which answer better fulfills the specific question asked? ## Output Format ## {{ "reasoning": "Your detailed justification explaining why prompt X or Y provided the more correct answer (~100 words).", "winner": "X or Y" }} 16 BBH: Reasoning-based Preference Judge Template ## Role ## You are a specialized judge focused on evaluating re...
-
[10]
**Logical Coherence** - Is the reasoning chain clear and well-structured?
-
[11]
**Completeness** - Does the reasoning address all key aspects of the problem?
-
[12]
**Clarity** - Is the reasoning easy to follow and understand?
-
[13]
**Accuracy** - Are the intermediate steps and assumptions correct? ## Output Format ## {{ "reasoning": "Your detailed justification explaining why prompt X or Y provided better reasoning (~100 words).", "winner": "X or Y" }} MS-MARCO: Preference Judge Template ## Role ## You are a meticulous, impartial referee evaluating two competing answers to determine...
-
[14]
**Accuracy** - How factually correct is each answer based on the context?
-
[15]
**Completeness** - Does the answer address all aspects of the question?
-
[16]
**Relevance** - How well does the answer stay focused on answering the question?
-
[17]
**Clarity** - How clear and well-articulated is the answer? ## Output Format ## {{ "reasoning": "Your detailed justification explaining why answer X or Y is better (~100 words).", "winner": "X or Y" }} 17 MS-MARCO: final evaluations with ground-truth references """ Begin your evaluation by carefully comparing the AI-generated answer with the reference sol...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.