Best-of-Both-Worlds Multi-Dueling Bandits: Unified Algorithms for Stochastic and Adversarial Preferences under Condorcet and Borda Objectives

Jawar Singh; Pratik Gajane; S Akash

arxiv: 2603.18972 · v3 · pith:37HCCE5Pnew · submitted 2026-03-19 · 💻 cs.LG

Best-of-Both-Worlds Multi-Dueling Bandits: Unified Algorithms for Stochastic and Adversarial Preferences under Condorcet and Borda Objectives

S Akash , Pratik Gajane , Jawar Singh This is my paper

Pith reviewed 2026-05-21 11:05 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-dueling banditsbest-of-both-worldsCondorcet winnerBorda scorepseudo-regretstochastic preferencesadversarial preferences

0 comments

The pith

A single algorithm achieves optimal regret in both stochastic and adversarial multi-dueling bandits without knowing the regime in advance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that multi-dueling bandits can be solved optimally whether preferences arrive from a fixed stochastic distribution or from an adversary. It introduces MetaDueling as a black-box reduction that turns any dueling-bandit algorithm into a multi-dueling algorithm by converting the single winner observation into an unbiased pairwise preference signal. When the reduction is paired with an existing versatile base algorithm, it simultaneously attains the adversarial rate of O(sqrt(KT)) and the instance-optimal stochastic rate of sum log T over Delta_i. A second algorithm, SA-MiDEX, supplies matching-style guarantees for the Borda objective. Matching lower bounds are proved for the Condorcet case.

Core claim

The authors establish that MetaDueling converts any dueling bandit algorithm into a multi-dueling algorithm by transforming multi-way winner feedback into an unbiased pairwise signal, and when instantiated with Versatile-DB it simultaneously achieves O(sqrt(KT)) pseudo-regret against adversarial preferences and the instance-optimal O(sum_{i ≠ a*} log T / Delta_i) pseudo-regret under stochastic preferences for the Condorcet objective; for the Borda objective, SA-MiDEX achieves O(K^2 log KT + K log^2 T + sum K log KT / (Delta_i^B)^2) regret in stochastic environments and O(K sqrt(T log KT) + K^{1/3} T^{2/3} (log K)^{1/3}) regret against adversaries, all without prior knowledge of the regime.

What carries the argument

The MetaDueling black-box reduction that transforms multi-way winner feedback into an unbiased pairwise signal for use by an underlying dueling bandit algorithm.

If this is right

The first best-of-both-worlds algorithm exists for multi-dueling bandits under the Condorcet winner assumption.
O(sqrt(KT)) pseudo-regret holds against adversarial preferences while instance-optimal logarithmic regret holds under stochastic preferences at the same time.
The Borda version supplies near-optimal upper bounds that are within a factor K of known lower bounds.
Matching lower bounds are established for the Condorcet setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reduction technique could be applied to other partial-feedback online learning problems to obtain best-of-both-worlds guarantees.
Recommendation and ranking systems could deploy the algorithm when user preferences mix stochastic and adversarial elements without needing to detect the mixture.
Removing the extra K factor in the Borda bounds or extending the approach to other scoring rules would be natural next steps.

Load-bearing premise

Transforming the multi-way winner feedback into an unbiased pairwise signal preserves the regret guarantees of the base dueling-bandit algorithm without adding bias or extra logarithmic factors.

What would settle it

Run MetaDueling on a stochastic multi-dueling instance whose gaps Delta_i are known in advance and check whether the observed pseudo-regret stays within a constant factor of sum log T / Delta_i; systematic exceedance would falsify the claim.

read the original abstract

Multi-dueling bandits, where a learner selects $m \geq 2$ arms per round and observes only the winner, arise naturally in many applications including ranking and recommendation systems, yet a fundamental question has remained open: can a single algorithm perform optimally in both stochastic and adversarial environments, without knowing which regime it faces? We answer this affirmatively, providing the first best-of-both-worlds algorithms for multi-dueling bandits under both Condorcet and Borda objectives. For the Condorcet setting, we propose $\texttt{MetaDueling}$, a black-box reduction that converts any dueling bandit algorithm into a multi-dueling bandit algorithm by transforming multi-way winner feedback into an unbiased pairwise signal. Instantiating our reduction with $\texttt{Versatile-DB}$ yields the first best-of-both-worlds algorithm for multi-dueling bandits: it achieves $O(\sqrt{KT})$ pseudo-regret against adversarial preferences and the instance-optimal $O\left(\sum_{i \neq a^\star} \frac{\log T}{\Delta_i}\right)$ pseudo-regret under stochastic preferences, both simultaneously and without prior knowledge of the regime. For the Borda setting, we propose $\texttt{SA-MiDEX}$, a stochastic-and-adversarial algorithm that achieves $O\left(K^2 \log KT + K \log^2 T + \sum_{i: \Delta_i^{\mathrm{B}} > 0} \frac{K\log KT}{(\Delta_i^{\mathrm{B}})^2}\right)$ regret in stochastic environments and $O\left(K \sqrt{T \log KT} + K^{1/3} T^{2/3} (\log K)^{1/3}\right)$ regret against adversaries, again without prior knowledge of the regime. We complement our upper bounds with matching lower bounds for the Condorcet setting. For the Borda setting, our upper bounds are near-optimal with respect to the lower bounds (within a factor of $K$) and match the best-known results in the literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first best-of-both-worlds algorithms for multi-dueling bandits under both Condorcet and Borda, via a black-box reduction and a dedicated SA-MiDEX algorithm.

read the letter

The main takeaway is that the authors close an open question by giving algorithms that hit the right regret rates in stochastic and adversarial multi-dueling settings without knowing which regime is present. For Condorcet they use MetaDueling, a reduction that feeds unbiased pairwise signals from m-way winner feedback into an existing dueling algorithm like Versatile-DB. This produces the O(sqrt(KT)) adversarial bound alongside the instance-optimal stochastic bound. For Borda they introduce SA-MiDEX with its stated rates that adapt across regimes, plus matching lower bounds for the Condorcet case and near-optimal upper bounds for Borda within a K factor of known limits. The reduction and the explicit construction for the second objective are the concrete advances here. Prior single-regime results do not directly yield this unified performance, so the work fills a gap in the literature on robust preference learning. The soft spot is the reduction step itself. The claim that the pairwise estimator preserves exact instance optimality rests on the signals being unbiased with variance that does not grow with m or introduce round-to-round dependence. If the proofs only show unbiasedness without controlling the extra variance or correlation terms, the stochastic bound could pick up logarithmic factors and lose its instance-dependent character while still satisfying the adversarial bound. That detail needs verification in the full analysis. This is for people working on bandits and online ranking systems. Readers who follow best-of-both-worlds results or multi-dueling extensions will get direct value from the algorithms and bounds. It deserves peer review because the problem is substantive and the approach is concrete enough for referees to check the estimator properties and the Borda derivations.

Referee Report

2 major / 2 minor

Summary. The paper claims to resolve an open question in multi-dueling bandits by providing the first best-of-both-worlds algorithms that achieve optimal regret simultaneously in stochastic and adversarial regimes without prior knowledge of the environment. For the Condorcet objective, MetaDueling is presented as a black-box reduction that converts any dueling-bandit algorithm into a multi-dueling algorithm via an unbiased pairwise signal derived from m-way winner feedback; instantiating it with Versatile-DB yields O(√KT) adversarial pseudo-regret and the instance-optimal O(∑_{i≠a*} log T / Δ_i) stochastic pseudo-regret. For the Borda objective, SA-MiDEX is proposed with explicit stochastic and adversarial regret bounds that are near-optimal (within a K factor of known lower bounds). Matching lower bounds are given for the Condorcet case.

Significance. If the reduction is shown to preserve exact regret guarantees without extra logarithmic or m-dependent factors, the results would be significant: they unify stochastic and adversarial analyses for multi-dueling bandits under both Condorcet and Borda objectives, provide the first such BoBW algorithm for the Condorcet case, and supply matching lower bounds. The black-box nature and explicit instantiation with Versatile-DB are strengths that could allow reuse of future dueling-bandit advances.

major comments (2)

[MetaDueling reduction] MetaDueling reduction (Section 3 and associated analysis): the claim that the unbiased pairwise estimator constructed from m-way winner feedback preserves the exact instance-optimal stochastic regret of Versatile-DB requires explicit variance and dependence bounds. If the estimator compares the observed winner against the remaining m-1 arms (or uses random pairing), the per-pair sample size effectively drops by a factor of m and signals become correlated across rounds; this can inflate the variance term in the gap-based analysis and replace the claimed O(∑ log T / Δ_i) with an extra log(KT) or log m factor, undermining instance optimality while still permitting the O(√KT) adversarial bound.
[Theorem on stochastic regret] Theorem stating the stochastic regret of the instantiated algorithm: the proof must verify that the reduction does not alter the concentration or gap arguments used by Versatile-DB. Without a lemma bounding the variance of the synthetic pairwise probabilities independently of m, the instance-optimal claim remains unverified and is load-bearing for the central best-of-both-worlds result.

minor comments (2)

[Abstract] The abstract states that the Borda upper bounds are 'near-optimal with respect to the lower bounds (within a factor of K)'; clarify whether this factor is unavoidable or an artifact of the analysis.
[Abstract] Notation for the Borda gaps Δ_i^B should be defined before the regret expression in the abstract to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading, positive assessment of the significance of our results, and constructive feedback on the MetaDueling reduction. We address the two major comments point by point below. We agree that making the variance analysis fully explicit will improve the manuscript and will add the requested lemma and verification steps in the revision.

read point-by-point responses

Referee: [MetaDueling reduction] MetaDueling reduction (Section 3 and associated analysis): the claim that the unbiased pairwise estimator constructed from m-way winner feedback preserves the exact instance-optimal stochastic regret of Versatile-DB requires explicit variance and dependence bounds. If the estimator compares the observed winner against the remaining m-1 arms (or uses random pairing), the per-pair sample size effectively drops by a factor of m and signals become correlated across rounds; this can inflate the variance term in the gap-based analysis and replace the claimed O(∑ log T / Δ_i) with an extra log(KT) or log m factor, undermining instance optimality while still permitting the O(√KT) adversarial bound.

Authors: We appreciate this precise concern regarding variance inflation and dependence. Our construction of the unbiased pairwise signal (detailed in Section 3) generates, for every pair involving the observed winner, an indicator that is exactly Bernoulli with parameter equal to the true pairwise preference probability; consequently its variance is at most 1/4 independently of m. Although pairs observed in the same round are dependent, the overall sequence remains a martingale with respect to the filtration of past rounds, allowing the same Freedman's inequality and gap-based analysis used by Versatile-DB to apply directly. The effective number of samples per pair is not reduced by a factor of m because each round contributes an unbiased observation to every pair that includes the winner. To make these facts fully explicit and rule out hidden logarithmic factors, we will insert a new Lemma 3.2 that states and proves the variance bound together with a short argument showing that the instance-dependent stochastic regret of Versatile-DB carries over unchanged. This addition will be included in the revised version. revision: yes
Referee: [Theorem on stochastic regret] Theorem stating the stochastic regret of the instantiated algorithm: the proof must verify that the reduction does not alter the concentration or gap arguments used by Versatile-DB. Without a lemma bounding the variance of the synthetic pairwise probabilities independently of m, the instance-optimal claim remains unverified and is load-bearing for the central best-of-both-worlds result.

Authors: We agree that the current sketch in the proof of Theorem 4.1 relies on the reader recognizing that the estimator is unbiased and bounded. To verify that neither the concentration inequalities nor the gap arguments are altered, we will add an explicit supporting lemma (placed immediately before Theorem 4.1) that (i) bounds the variance of each synthetic pairwise probability by 1/4 for any m, (ii) confirms that the effective gaps Δ_i remain identical, and (iii) shows that the number of effective observations per pair is preserved up to a universal constant. With this lemma the application of Versatile-DB's analysis becomes fully rigorous and the claimed O(∑_{i≠a*} log T / Δ_i) bound holds without extra factors. The revised manuscript will contain this verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central contribution is a black-box reduction (MetaDueling) that converts multi-dueling winner feedback into unbiased pairwise signals for use with an external dueling-bandit algorithm (Versatile-DB). The claimed regret bounds are inherited from the base algorithm's guarantees plus the properties of the transformation; neither the stochastic instance-optimal bound nor the adversarial O(√KT) bound is obtained by fitting parameters to the target data or by redefining the output in terms of itself. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the abstract or claimed derivation. The construction is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; standard multi-armed bandit assumptions such as bounded rewards and existence of a Condorcet winner or Borda scores are implicit but not detailed.

axioms (1)

domain assumption Existence of a Condorcet winner or well-defined Borda scores in the preference model
Required for the instance-optimal stochastic regret expressions to be meaningful.

pith-pipeline@v0.9.0 · 5928 in / 1271 out tokens · 46362 ms · 2026-05-21T11:05:18.180549+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MetaDueling, a black-box reduction that converts any dueling bandit algorithm into a multi-dueling bandit algorithm by transforming multi-way winner feedback into an unbiased pairwise signal.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

O(√KT) pseudo-regret against adversarial preferences and the instance-optimal O(∑ log T / Δ_i) pseudo-regret under stochastic preferences

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.