pith. sign in

arxiv: 2510.13907 · v3 · submitted 2025-10-14 · 💻 cs.CL · stat.ML

LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

Pith reviewed 2026-05-18 06:58 UTC · model grok-4.3

classification 💻 cs.CL stat.ML
keywords prompt optimizationlabel-freedueling banditsLLM judgepairwise preferenceBIG-bench HardMS MARCO
0
0 comments X

The pith

The Prompt Duel Optimizer finds better prompts by using LLM pairwise judgments in a dueling-bandit setup without any ground-truth labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the Prompt Duel Optimizer (PDO) to improve prompts for large language models when labeled validation data is unavailable or expensive. It models prompt selection as a dueling-bandit problem and draws on pairwise preference signals from an LLM judge to decide which prompts are stronger. Double Thompson Sampling chooses the most informative comparisons under a fixed budget while top-performer guided mutation generates new candidates and removes weak ones. On BIG-bench Hard and MS MARCO the approach yields stronger prompts than other label-free methods at comparable or lower judgment cost.

Core claim

PDO casts prompt optimization as a dueling-bandit problem and uses Double Thompson Sampling to prioritize informative pairwise comparisons from an LLM judge combined with top-performer guided mutation to expand and prune the candidate pool, enabling effective label-free optimization.

What carries the argument

Dueling-bandit formulation of prompt selection, where Double Thompson Sampling selects informative pairwise comparisons and top-performer guided mutation manages the candidate pool.

If this is right

  • Under a fixed number of LLM judge comparisons, PDO returns higher-quality prompts than random selection or simpler ranking baselines.
  • The method reduces reliance on expensive labeled validation sets for prompt engineering.
  • Quality-cost trade-offs remain favorable when the total judge budget is severely limited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same comparison-based selection could be applied to optimize other LLM behaviors such as reasoning chains or tool-use formats.
  • Combining PDO with occasional human judgments might reduce judge bias while keeping the overall budget low.
  • The mutation step may surface interpretable patterns about what prompt changes most affect downstream performance.

Load-bearing premise

Pairwise preferences from an LLM judge are reliable enough to predict which prompt will actually perform better on the downstream task.

What would settle it

Evaluate the final prompts chosen by PDO and by baselines directly on the target tasks using ground-truth metrics; the performance gap should disappear or reverse if judge preferences are uncorrelated with real accuracy.

Figures

Figures reproduced from arXiv: 2510.13907 by Amel Awadelkarim, Fangzhou Xiong, Justin Lee, Poppy Zhang, Saurabh Verma, Shawndra Hill, Xu Chen, Yuanchen Wu, Yubai Yuan.

Figure 1
Figure 1. Figure 1: Workflow of the Prompt Duel Optimizer. Prompt Optimization Objective. The goal of prompt optimization is to identify a prompt that maximizes task performance. In the absence of ground-truth references, we use pairwise preferences as a practical proxy for selecting high-quality prompts. Using em￾pirical estimates µb(pi , pj ), we select the Con￾dorcet winner when it exists, or otherwise the Copeland winner.… view at source ↗
Figure 2
Figure 2. Figure 2: Test performance of the winning prompt on the four MS-MARCO tasks. Each curve shows [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: a shows that the oracle converges to the best prompt by round 4; with the LLM judge, Tracking-7 steadily improves to rank 2 and Web of Lies approaches rank ≈ 2.5, while Geometric remains around ranks 6–8 across rounds. These trends confirm that judge re￾liability is closely related to the performance gaps observed in Tables 1 and 2. Reducing Judge Noise. We hypothesize that a key source of judge noise in B… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of pairwise preference used in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: For each BBH dataset, we construct a prompt pair ( [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of the reasoning-discount γ in the D-TS update across six BBH datasets. Each plot shows the ground-truth accuracy rank (lower is better) of the current Copeland leader over rounds. We fix γAnswer = 0.5 and ablate γReasoning ∈ {0.0, 0.2, 0.5}. Results indicate that introducing a mild discount at γReasoning = 0.2 generally accelerates convergence and produces better final ranks overall compared to the… view at source ↗
Figure 8
Figure 8. Figure 8: Effect of top-performer–guided mutation on Web of Lies (left) and Tracking-7 (right). duels per round. At rounds 10 and 20, we apply prompt mutation by selecting the top-3 prompts ranked by Copeland scores and gen￾erating 10 new prompts. At the same time, we prune the 10 lowest-ranked prompts by Copeland scores. We always select the prompt with the highest Copeland score as the winner, using the average wi… view at source ↗
read the original abstract

Large language models (LLMs) are highly sensitive to prompts, but most automatic prompt optimization (APO) methods assume access to ground-truth references (e.g., labeled validation data) that are costly to obtain. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization based on pairwise preference feedback from an LLM judge. PDO casts prompt selection as a dueling-bandit problem and combines (i) Double Thompson Sampling to prioritize informative comparisons under a fixed judge budget, with (ii) top-performer guided mutation to expand the candidate pool while pruning weak prompts. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently identifies stronger prompts than label-free baselines, while offering favorable quality--cost trade-offs under constrained comparison budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Prompt Duel Optimizer (PDO), a label-free prompt optimization framework that casts prompt selection as a dueling-bandit problem. It combines Double Thompson Sampling to prioritize informative pairwise comparisons under a fixed LLM-judge budget with top-performer guided mutation to expand and prune the candidate pool. Experiments on BIG-bench Hard (BBH) and MS MARCO report that PDO identifies stronger prompts than label-free baselines while achieving favorable quality-cost trade-offs under constrained comparison budgets.

Significance. If the experimental support is robust, the work could meaningfully advance automatic prompt optimization by removing dependence on ground-truth labels. The integration of established dueling-bandit algorithms with mutation operators provides a concrete, budget-aware alternative to existing label-free methods and may influence practical LLM deployment pipelines where labeled validation data are scarce.

major comments (2)
  1. [Experiments] Experiments section: the central claim that PDO 'consistently identifies stronger prompts' and offers favorable quality-cost trade-offs rests on LLM-judge pairwise preferences serving as a reliable proxy for downstream task performance, yet no correlation statistics, calibration plots, or ablations against ground-truth accuracy/relevance on BBH or MS MARCO are supplied; without these the label-free setting remains unverified.
  2. [Method and Experiments] Method and Experiments: the manuscript applies Double Thompson Sampling and mutation operators but provides no details on the number of independent runs, statistical significance tests, or error bars for the reported outperformance; these omissions make it impossible to assess whether the observed gains exceed baseline variance under the stated comparison budgets.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'favorable quality--cost trade-offs' is used without defining the precise quality and cost metrics (e.g., exact judge score vs. token count) that are later compared.
  2. [Method] Notation: the description of 'top-performer guided mutation' would benefit from an explicit pseudocode listing or equation showing how the mutation operator is conditioned on the current top-ranked prompts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and commit to revisions that strengthen the experimental validation and reporting without altering the core contributions of the label-free optimization framework.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim that PDO 'consistently identifies stronger prompts' and offers favorable quality-cost trade-offs rests on LLM-judge pairwise preferences serving as a reliable proxy for downstream task performance, yet no correlation statistics, calibration plots, or ablations against ground-truth accuracy/relevance on BBH or MS MARCO are supplied; without these the label-free setting remains unverified.

    Authors: We appreciate the referee's emphasis on verifying the LLM-judge proxy. While PDO performs optimization using only pairwise preferences, the final selected prompts are evaluated against ground-truth labels on BBH (accuracy) and MS MARCO (relevance) to quantify actual gains over baselines. To directly address this point, the revised manuscript will add Spearman rank correlations between aggregated judge preference scores and ground-truth metrics, calibration plots of judge reliability, and ablations showing alignment between judge-guided rankings and true performance under the fixed comparison budgets. These additions will be placed in the Experiments section to confirm the proxy's effectiveness. revision: yes

  2. Referee: [Method and Experiments] Method and Experiments: the manuscript applies Double Thompson Sampling and mutation operators but provides no details on the number of independent runs, statistical significance tests, or error bars for the reported outperformance; these omissions make it impossible to assess whether the observed gains exceed baseline variance under the stated comparison budgets.

    Authors: We agree that these experimental details are essential for assessing robustness. The revised version will explicitly state that all results are averaged over 5 independent runs with different random seeds, include standard deviation error bars in figures and tables, and report statistical significance via paired t-tests (or Wilcoxon signed-rank tests where appropriate) with p-values to show that gains exceed baseline variance under the constrained budgets. revision: yes

Circularity Check

0 steps flagged

No circularity: framework applies established dueling-bandit methods without self-referential reduction.

full rationale

The paper casts prompt selection as a dueling-bandit problem and invokes Double Thompson Sampling plus mutation operators drawn from prior literature. No equations or steps reduce the claimed performance gains to a quantity defined by a fitted parameter from the same data, nor does any load-bearing premise rest solely on self-citation. The derivation remains self-contained against external benchmarks (BBH, MS MARCO) and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM pairwise judgments are informative for prompt ranking and on standard bandit convergence properties; free parameters include the comparison budget and any mutation-rate or sampling hyperparameters that are not detailed in the abstract.

free parameters (2)
  • comparison budget
    Fixed number of LLM judge calls is a core resource constraint in the dueling-bandit formulation.
  • mutation and sampling hyperparameters
    Parameters controlling how top prompts are mutated and how Thompson Sampling is tuned are required for the algorithm but not specified in the abstract.
axioms (1)
  • domain assumption LLM judge pairwise preferences correlate with actual task performance
    The optimization loop depends on the judge providing useful ranking signals without ground truth.

pith-pipeline@v0.9.0 · 5686 in / 1379 out tokens · 47481 ms · 2026-05-18T06:58:09.914152+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

    cs.AI 2026-05 unverdicted novelty 7.0

    ReElicit uses LLMs to elicit adaptive feature embeddings for Gaussian process Bayesian optimization of system prompts under aggregate-only feedback, outperforming baselines across ten tasks with a 30-evaluation budget.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268. Viktor Bengs, R´ obert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke H¨ ullermeier. 2021. Preference-based online learning with dueling bandits: A survey.Journal of Machine Learning Research, 22(7):1–108. Open access. Yongchao Chen, Jacob Arkin, Yilun Hao, Ya...

  2. [2]

    The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

    Snorkel: Rapid training data creation with weak supervision.Proceedings of the VLDB En- dowment, 11(3):269–282. Samuel Schulhoff and 1 others. 2024. A systematic survey of prompting techniques.arXiv preprint arXiv:2406.06608. Chengshuai Shi, Kun Yang, Jing Yang, and Cong Shen. 2024. Efficient prompt optimization through the lens of best arm identification...

  3. [3]

    Self-supervised prompt optimization.arXiv preprint arXiv:2502.06855, 2025

    Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. InProceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 2609–2634, Toronto, Canada. Association for Computational Linguistics. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ic...

  4. [4]

    Large language models as optimizers. In ICLR. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica

  5. [5]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685. Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large language models are human-level prompt engineers.arXiv preprint arXiv:2211.01910. Masrour Zoghi, Shimon Whiteson, R´ emi Munos, and Maarten de Rijke. 201...

  6. [6]

    Starting from |P| = 20 instructions, D- TS runs pairwise duels. For themutateset- ting, at rounds 10 and 20 we prune the 10 lowest–Copeland-score prompts and generate 10 new candidates by mutating the current Copeland leader, then continue dueling with the updated pool. In contrast, thenon mutate baseline keeps the initial 20 prompts fixed and selects lea...

  7. [7]

    {answer_X}

    is a large-scale question answering dataset built from real Bing search queries, paired with human-written answers and linked passages. It supports QA, passage ranking, and related IR/NLP tasks. In our setting, we focus on four task categories—Description, En- tity, Numeric, and Location—and adopt a 1–5 integer scoring scheme from an LLM judge that compar...

  8. [8]

    **Factual Accuracy** - Which answer better matches reality and task requirements?

  9. [9]

    reasoning

    **Task Alignment** - Which answer better fulfills the specific question asked? ## Output Format ## {{ "reasoning": "Your detailed justification explaining why prompt X or Y provided the more correct answer (~100 words).", "winner": "X or Y" }} 16 BBH: Reasoning-based Preference Judge Template ## Role ## You are a specialized judge focused on evaluating re...

  10. [10]

    **Logical Coherence** - Is the reasoning chain clear and well-structured?

  11. [11]

    **Completeness** - Does the reasoning address all key aspects of the problem?

  12. [12]

    **Clarity** - Is the reasoning easy to follow and understand?

  13. [13]

    reasoning

    **Accuracy** - Are the intermediate steps and assumptions correct? ## Output Format ## {{ "reasoning": "Your detailed justification explaining why prompt X or Y provided better reasoning (~100 words).", "winner": "X or Y" }} MS-MARCO: Preference Judge Template ## Role ## You are a meticulous, impartial referee evaluating two competing answers to determine...

  14. [14]

    **Accuracy** - How factually correct is each answer based on the context?

  15. [15]

    **Completeness** - Does the answer address all aspects of the question?

  16. [16]

    **Relevance** - How well does the answer stay focused on answering the question?

  17. [17]

    reasoning

    **Clarity** - How clear and well-articulated is the answer? ## Output Format ## {{ "reasoning": "Your detailed justification explaining why answer X or Y is better (~100 words).", "winner": "X or Y" }} 17 MS-MARCO: final evaluations with ground-truth references """ Begin your evaluation by carefully comparing the AI-generated answer with the reference sol...