Robust Linear Dueling Bandits with Post-serving Context under Unknown Delays and Adversarial Corruptions

Youngmin Oh

arxiv: 2605.01752 · v4 · pith:NU3GWIO7new · submitted 2026-05-03 · 💻 cs.LG

Robust Linear Dueling Bandits with Post-serving Context under Unknown Delays and Adversarial Corruptions

Youngmin Oh This is my paper

Pith reviewed 2026-05-21 00:32 UTC · model grok-4.3

classification 💻 cs.LG

keywords linear dueling banditsdelayed feedbackadversarial corruptionpost-serving contextregret boundsrobust learningbandit algorithms

0 comments

The pith

A new algorithm for linear dueling bandits achieves additive regret O(d(sqrt(T) + C + D)) for unknown delays and adversarial corruptions by learning post-serving contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method for linear dueling bandits that must operate when feedback arrives after serving, can be delayed by unknown amounts, and can be corrupted by an adversary up to a total budget C. It introduces a learned approximator that predicts the post-serving context from pre-serving information and combines it with an adaptive weighting scheme that clips features to limit the damage from both delays and corruptions at once. Under the assumption that the post-serving mapping is parametric, the analysis shows the algorithm works independently of the delay regime and produces a regret bound in which the corruption and delay terms add rather than multiply. Matching lower bounds are given for the special case of adversarial delays without post-serving contexts. The additive structure improves on earlier results that suffered multiplicative blow-ups when multiple difficulties were present simultaneously.

Core claim

Under standard regularity conditions and a parametric post-serving mapping, the algorithm is delay-regime-agnostic and attains regret upper bound O~(d(sqrt(T) + C + D)), where D encodes delay complexity; the analysis establishes an additive cost between corruption and delay instead of the multiplicative degradation seen in prior work, with nearly matching lower bounds up to a sqrt(d) factor for adversarial delays without post-serving contexts.

What carries the argument

A learned approximator that predicts post-serving contexts from pre-serving information, together with an adaptive weighting strategy that clips feature vectors to handle corrupted and delayed observations simultaneously.

If this is right

The regret bound holds uniformly across stochastic, adversarial, and mixed delay regimes without needing to know the regime in advance.
Corruption budget C and delay complexity D contribute separately to the total regret, so improving one does not automatically worsen the effect of the other.
The same clipping and weighting mechanism simultaneously mitigates both delayed and corrupted observations.
When post-serving contexts are absent, the upper bound nearly matches the lower bound up to a sqrt(d) factor for adversarial delays.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The additive separation could let practitioners allocate separate budgets or resources to delay mitigation and corruption defense.
The parametric assumption on the context mapping suggests that extensions to kernel or neural approximators might be tested by replacing the linear predictor while keeping the clipping step.
The same additive structure may appear in other sequential decision problems where feedback is both late and noisy.
Lower-bound constructions for the no-context case could be adapted to quantify the extra price paid when a learned approximator is required.

Load-bearing premise

The mapping from pre-serving information to post-serving context must be parametric so a learned approximator can predict it reliably.

What would settle it

An experiment or calculation in which the combined regret from delays and corruptions grows multiplicatively rather than additively would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.01752 by Youngmin Oh.

**Figure 1.** Figure 1: Performance comparison without post-serving contexts, i.e., the linear setting (C = 25, Λ = 104 , µτ = 100, σ = 10; 5 runs). RCDP-UCB consistently outperforms RCDB, demonstrating the robustness of our unified weighting mechanism. sampled uniformly from [−π, π] dx . We employ three non-linear mappings ϕ∗ : R dx → R dy : (i) Polynomial: yt,k ∝ [x 2 t,k, p |xt,k|] ⊤; (ii) Sinusoidal: yt,k ∝ [cos(xt,k),sin(x… view at source ↗

**Figure 3.** Figure 3: Cumulative regret where only RCDP-UCB exploits the learned mapping ϕ∗, while baselines are restricted to pre-serving contexts (C = 25, Λ = 104 , N (100, 102 )). Averaged over 5 runs, RCDP-UCB demonstrates superior performance relative to the baselines. Impact of Post-Serving Context Awareness [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 2.** Figure 2: Cumulative regret with learned post-serving contexts (C = 25, Λ = 104 , N (100, 102 )). Averaged over 5 runs, RCDPUCB consistently outperforms baselines, demonstrating superior robustness in latent environments. theoretical results of each respective work. 6.3. Empirical Analysis Efficacy of Adaptive Weighting [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

We study linear dueling bandits in volatile environments characterized by the simultaneous presence of post-serving contexts, delayed feedback, and adversarial corruption. Feedback is subject to unknown stochastic or adversarial delays and a cumulative corruption budget $\mathcal{C}$. To address these challenges, we propose e RCDP-UCB, which integrates a learned approximator that predicts post-serving contexts from pre-serving information. It further employs an adaptive weighting strategy that clips feature vectors to mitigate the impact of corrupted and delayed observations simultaneously. Under standard regularity conditions and a parametric post-serving mapping, we rigorously establish that our algorithm is delay-regime-agnostic, achieving a regret upper bound of $\widetilde{\mathcal{O}}(d(\sqrt{T} + \mathcal{C} + \mathcal{D}))$, where $d$ is the total feature dimension and $\mathcal{D}$ encapsulates the delay complexity. Crucially, our analysis reveals an additive cost structure between corruption and delay, avoiding the multiplicative degradation typical of prior works. We further establish lower bounds that nearly match our upper bounds up to a $\sqrt{d}$ factor for adversarial delays in the absence of post-serving contexts. Code is available at https://github.com/youngmin0oh/rcdp-public.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims an additive regret bound for linear dueling bandits under delays and corruptions, but the bound depends on a parametric post-serving context assumption whose approximation error is not clearly controlled.

read the letter

The main point is that this work gets an additive regret term for both unknown delays and adversarial corruptions in linear dueling bandits, instead of the multiplicative blow-up common in earlier results. They achieve this by training a parametric approximator to predict post-serving contexts from pre-serving features and by applying adaptive clipping to down-weight corrupted or delayed observations in the estimator. The claimed bound is roughly d times (sqrt(T) plus C plus D), with lower bounds that come close for the no-context case. That combination and the additive structure look new relative to the cited prior work on dueling bandits with delays or corruptions alone. The delay-regime-agnostic property is also a practical plus if the analysis holds. The abstract presents the algorithm components and the bounds in a straightforward way, and the lower-bound matching up to sqrt(d) shows they engaged with the hardness side. The soft spot is exactly the one flagged in the stress-test note. The regret guarantee is stated only under a parametric post-serving mapping that lets them learn an approximator. If the proof treats the approximator output as exact or folds its finite-sample error into vanishing lower-order terms, the additive corruption-delay structure will not survive once the approximator is trained on limited data; the error could leak into the linear estimator and the clipping weights and turn the bound multiplicative. The abstract asserts the condition but does not show the error-propagation step, so the central claim is not yet fully supported without the full derivation. This paper is aimed at people working on robust online learning and bandits for recommendation or ranking systems where feedback arrives late or is noisy. A reader who wants concrete regret guarantees that handle several real-world issues at once would get something from the bounds and the algorithm sketch. It deserves a serious referee so the proofs can be checked for the approximation-error control and the clipping analysis.

Referee Report

1 major / 1 minor

Summary. The paper proposes an algorithm (termed something like Robust Linear Dueling Bandit with Context Approximation) for linear dueling bandits facing simultaneous post-serving contexts, unknown stochastic or adversarial delays, and adversarial corruptions with total budget C. It integrates a learned approximator to predict post-serving contexts from pre-serving information and uses adaptive feature clipping to mitigate corrupted and delayed observations. Under standard regularity conditions and a parametric post-serving mapping, the algorithm is claimed to be delay-regime-agnostic with regret upper bound O~(d(sqrt(T) + C + D)), featuring an additive corruption-delay cost structure (unlike multiplicative degradation in prior work). Nearly matching lower bounds are given for adversarial delays without post-serving contexts.

Significance. If the claims hold with rigorous error control, the result would be significant for robust online learning: it simultaneously addresses three volatility sources while preserving an additive regret structure and providing nearly tight lower bounds. This advances beyond prior works that suffer multiplicative penalties when combining delays and corruptions. The parametric mapping assumption enables the learned approximator but is central to whether the additive bound survives finite-sample training.

major comments (1)

Abstract and main regret theorem: the upper bound O~(d(sqrt(T) + C + D)) is established only under the parametric post-serving mapping that permits a learned approximator. The manuscript must explicitly bound the approximation error when the predicted context is substituted into the linear dueling estimator and the clipping weights; if the proof treats the approximator as exact or absorbs the error only into lower-order terms that vanish under tilde-O, the additive corruption-delay structure fails to hold for finite-data training and the bound degrades.

minor comments (1)

Notation section: explicitly define the delay complexity term D and state how it is instantiated for stochastic versus adversarial delay regimes.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The single major comment raises an important point about making the approximation error explicit. We address it directly below and will incorporate the requested clarification in the revision.

read point-by-point responses

Referee: Abstract and main regret theorem: the upper bound O~(d(sqrt(T) + C + D)) is established only under the parametric post-serving mapping that permits a learned approximator. The manuscript must explicitly bound the approximation error when the predicted context is substituted into the linear dueling estimator and the clipping weights; if the proof treats the approximator as exact or absorbs the error only into lower-order terms that vanish under tilde-O, the additive corruption-delay structure fails to hold for finite-data training and the bound degrades.

Authors: We agree that the error introduced by substituting the learned approximator's prediction must be controlled explicitly. Under the parametric post-serving mapping and standard regularity conditions stated in the paper, the approximator is trained on a finite sample whose size grows with T; the resulting uniform approximation error is O(1/sqrt(T)) with high probability and enters the regret analysis only through the linear estimator and the adaptive clipping weights. Because this term is absorbed into the existing lower-order contributions already hidden by the tilde-O notation, the leading additive structure d(sqrt(T) + C + D) is preserved. To satisfy the referee's request for transparency, we will add a dedicated lemma (and the corresponding proof steps) that isolates this approximation error, shows it does not multiply the corruption or delay terms, and confirms the bound remains valid for finite-sample training. We will also update the theorem statement and abstract to reference this lemma. revision: yes

Circularity Check

0 steps flagged

No circularity: regret bound derived from analysis under explicit assumptions

full rationale

The paper presents a theoretical regret analysis for the proposed algorithm under the stated parametric post-serving mapping and regularity conditions. The bound O~(d(sqrt(T) + C + D)) is obtained via standard regret decomposition and adaptive weighting arguments rather than by fitting parameters to the target quantity or reducing via self-citation chains. No step equates the claimed result to its inputs by construction; the derivation remains self-contained with independent mathematical content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard regularity conditions from the bandit literature plus the assumption that the post-serving mapping is parametric.

axioms (2)

domain assumption standard regularity conditions
Invoked to establish the regret upper and lower bounds.
domain assumption parametric post-serving mapping
Allows the learned approximator to predict post-serving contexts from pre-serving information.

pith-pipeline@v0.9.0 · 5721 in / 1346 out tokens · 51408 ms · 2026-05-21T00:32:16.890435+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

achieving a regret upper bound of O~(d(√T + C + D)) ... additive cost structure between corruption and delay

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.