Queue Length Regret Bounds for Contextual Queueing Bandits

Dabeen Lee; Garyeong Kang; Seoungbin Bae

arxiv: 2601.19300 · v2 · pith:KNUJXFVPnew · submitted 2026-01-27 · 💻 cs.LG

Queue Length Regret Bounds for Contextual Queueing Bandits

Seoungbin Bae , Garyeong Kang , Dabeen Lee This is my paper

Pith reviewed 2026-05-21 14:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords contextual banditsqueueing banditsregret boundsschedulinglogistic regressioncoupling argumentqueue length

0 comments

The pith

Contextual queueing bandits achieve sublinear queue length regret bounds using policy-switching queues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces contextual queueing bandits as a setting where jobs with contextual features are scheduled to servers with unknown logistic service rates to minimize queue lengths. It measures performance by the regret in queue length compared to an optimal policy that knows the rates. The analysis challenge is that policies can leave different sets of jobs in the queue, complicating direct comparisons. The authors address this with policy-switching queues and a coupling argument that decomposes regret into immediate and propagated effects, leading to a regret bound of order T to the power of negative one fourth for their algorithm.

Core claim

The paper claims that a novel regret decomposition framework based on policy-switching queues and coupling arguments allows bounding the queue length regret for contextual queueing bandits, with the CQB-ε algorithm achieving a bound of order T^{-1/4} and the CQB-Opt algorithm achieving O(log^2 T) under adversarial contexts.

What carries the argument

Policy-switching queues with a coupling argument that separates the short-term cost of suboptimal matches from their long-term impact on queue states.

Load-bearing premise

Service and departure rates follow a logistic model based on job contextual features and unknown server-specific parameters.

What would settle it

Simulations or real deployments where the measured queue length regret exceeds the predicted upper bounds of T to the power of one fourth despite the logistic rate model holding.

read the original abstract

We introduce contextual queueing bandits, a new context-aware framework for scheduling while simultaneously learning unknown service rates. Individual jobs carry heterogeneous contextual features, based on which the agent chooses a job and matches it with a server to maximize the departure rate. The service/departure rate is governed by a logistic model of the contextual feature with an unknown server-specific parameter. To evaluate the performance of a policy, we consider queue length regret, defined as the difference in queue length between the policy and the optimal policy. The main challenge in the analysis is that the lists of remaining job features in the queue may differ under our policy versus the optimal policy for a given time step, since they may process jobs in different orders. To address this, we propose the idea of policy-switching queues equipped with a sophisticated coupling argument. This leads to a novel queue length regret decomposition framework, allowing us to understand the short-term effect of choosing a suboptimal job-server pair and its long-term effect on queue state differences. We show that our algorithm, CQB-$\varepsilon$, achieves a regret upper bound of $\widetilde{\mathcal{O}}(T^{-1/4})$. We also consider the setting of adversarially chosen contexts, for which our second algorithm, CQB-Opt, achieves a regret upper bound of $\mathcal{O}(\log^2 T)$. Lastly, we provide experimental results that validate our theoretical findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sets up contextual queueing bandits and uses a policy-switching coupling to decompose regret into short-term and long-term queue effects, but the resulting T^{-1/4} bound looks loose and the coupling may leave uncontrolled residuals.

read the letter

The main new piece is the contextual queueing bandits setup itself: jobs arrive with features, you pick which job to send to which server, and the departure rate follows a logistic model with unknown server parameters. Regret is measured directly by the difference in total queue length versus the optimal policy, not by cumulative reward. That metric forces them to track how different processing orders leave different feature lists in the queue at later times, which is the core technical difficulty they flag in the abstract. Their policy-switching queues plus coupling argument is a direct attempt to handle that dependence, and the resulting short-term versus long-term decomposition is a clean way to separate the cost of exploration from the downstream effect on queue state. The adversarial-context variant getting O(log^2 T) is also a reasonable extension. Those elements are the actual contribution; the rest is standard contextual bandit machinery wrapped around the queueing constraint. The T^{-1/4} rate for the stochastic case is the clearest soft spot. It is slower than typical contextual bandit rates, which is expected once queue dynamics are added, but it suggests the coupling may not be tight enough to absorb all the state-difference terms. The stress-test concern is worth checking in the proofs: if the coupling only matches marginal distributions step by step without controlling the ordering of distinct feature realizations, an additive term linear in the number of distinct contexts could appear in the long-term component and break the claimed bound. The logistic modeling assumption is another limitation, but it is stated up front so readers know the scope. Experiments are mentioned only at the abstract level, so their strength is unclear. This paper is for researchers working on online scheduling or resource allocation with learning. A reader who already knows contextual bandits and wants to see how queue-length dependence changes the analysis will find the decomposition useful. The work is coherent enough on its own terms to deserve a serious referee, even if the bounds end up needing improvement.

Referee Report

2 major / 2 minor

Summary. The paper introduces contextual queueing bandits, a setting in which jobs arrive with contextual features and must be matched to servers whose service rates follow an unknown logistic model. Performance is measured by queue-length regret (difference in total queue length versus the optimal policy). The main technical challenge is that policies may process jobs in different orders, causing the lists of remaining job features to diverge. The authors address this via a novel decomposition based on policy-switching queues and a coupling argument that separates short-term suboptimality from long-term state differences. They present algorithm CQB-ε, which achieves queue-length regret Õ(T^{-1/4}) under stochastic contexts, and algorithm CQB-Opt, which achieves O(log² T) regret under adversarial contexts. Experimental results are provided to support the theory.

Significance. If the claimed bounds are established without residual terms that grow faster than the target rate, the work would constitute a meaningful extension of contextual bandits to queueing systems with state-dependent feedback. The policy-switching queue decomposition offers a potentially reusable technique for regret analysis in settings where action order affects future state distributions. The adversarial-context result is particularly notable because it removes the usual stochasticity assumption on contexts while retaining a polylogarithmic bound.

major comments (2)

[Queue-length regret decomposition and coupling argument (analysis section following the algorithm definition)] The coupling argument for policy-switching queues (introduced to control divergence of remaining job-feature lists) is load-bearing for the Õ(T^{-1/4}) claim. It is unclear whether the coupling only matches marginal distributions at each step or also bounds the accumulated ordering differences across distinct feature realizations; if the latter produces an additive term linear in the number of distinct contexts, the long-term component of the decomposition would exceed the short-term exploration cost and invalidate the overall rate. A explicit bound on the residual queue-length difference after coupling (e.g., in the proof of the main regret theorem) is required.
[Proof of Theorem 1 (or equivalent statement of the Õ(T^{-1/4}) bound for CQB-ε)] The decomposition framework converts per-step suboptimality into a total queue-length difference of order T^{3/4}. The short-term term is controlled by standard contextual-bandit techniques, but the long-term term relies on the coupling to be o(T^{3/4}). If the coupling leaves an extra additive factor that scales with the number of distinct feature vectors observed up to time t, the claimed Õ(T^{-1/4}) average regret does not follow. The proof should state the precise order of the residual term after applying the coupling.

minor comments (2)

[Abstract] The abstract uses both Õ and O notation; clarify whether the Õ hides only logarithmic factors or additional problem-dependent constants.
[Model section] The logistic service-rate model is stated as the governing assumption; add a brief remark on what happens if the true departure probability deviates from the logistic form (e.g., whether the regret guarantees degrade gracefully).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and for recognizing the potential significance of contextual queueing bandits and the policy-switching decomposition. We address the major comments point by point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Queue-length regret decomposition and coupling argument (analysis section following the algorithm definition)] The coupling argument for policy-switching queues (introduced to control divergence of remaining job-feature lists) is load-bearing for the Õ(T^{-1/4}) claim. It is unclear whether the coupling only matches marginal distributions at each step or also bounds the accumulated ordering differences across distinct feature realizations; if the latter produces an additive term linear in the number of distinct contexts, the long-term component of the decomposition would exceed the short-term exploration cost and invalidate the overall rate. A explicit bound on the residual queue-length difference after coupling (e.g., in the proof of the main regret theorem) is required.

Authors: We thank the referee for this observation. The coupling is constructed to match marginal distributions of remaining features at each step while also controlling accumulated ordering differences via the limited policy switches that occur only during exploration. We agree that an explicit bound on the residual queue-length difference will improve clarity. In the revised manuscript we will add a supporting lemma in the analysis section that states this residual explicitly and confirms it remains o(T^{3/4}), preserving the claimed rate. revision: yes
Referee: [Proof of Theorem 1 (or equivalent statement of the Õ(T^{-1/4}) bound for CQB-ε)] The decomposition framework converts per-step suboptimality into a total queue-length difference of order T^{3/4}. The short-term term is controlled by standard contextual-bandit techniques, but the long-term term relies on the coupling to be o(T^{3/4}). If the coupling leaves an extra additive factor that scales with the number of distinct feature vectors observed up to time t, the claimed Õ(T^{-1/4}) average regret does not follow. The proof should state the precise order of the residual term after applying the coupling.

Authors: We agree that the proof of Theorem 1 would benefit from an explicit statement of the residual term's order. The long-term component is controlled by the coupling to be o(T^{3/4}) under the stochastic context assumption. In the revision we will state the precise order of this residual term directly in the proof, showing that any dependence on the number of observed distinct contexts remains sufficiently small to validate the overall Õ(T^{-1/4}) bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses novel decomposition and coupling

full rationale

The paper's central contribution is a new queue length regret decomposition framework based on policy-switching queues and a coupling argument to control differences in remaining job feature lists between the policy and optimum. This is introduced to address the core challenge and enables the T^{-1/4} bound without reducing the final regret expression to a fitted parameter or self-citation by construction. The analysis is self-contained given the logistic service rate model; no load-bearing step equates the claimed bound to its inputs via definition or prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the logistic service rate model and the technical device of policy-switching queues to enable the coupling argument for regret analysis.

axioms (1)

domain assumption The service/departure rate is governed by a logistic model of the contextual feature with an unknown server-specific parameter.
This is the modeling choice that allows learning the rates and deriving the regret bounds, as stated in the abstract.

invented entities (1)

policy-switching queues no independent evidence
purpose: To facilitate a coupling argument that handles differences in remaining job features between the policy and the optimal policy.
Introduced specifically to address the main analytical challenge of differing queue states over time.

pith-pipeline@v0.9.0 · 5780 in / 1384 out tokens · 69988 ms · 2026-05-21T14:50:00.773852+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Logistic Bandits with $\tilde{O}(\sqrt{dT})$ Regret without Context Diversity Assumptions
cs.LG 2026-04 unverdicted novelty 8.0

SupSplitLog achieves Õ(√(dT)) regret for logistic bandits without context diversity assumptions by splitting samples for an initial estimator and Newton correction, and can adapt to data-dependent bounds.