Recognition: no theorem link
ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving
Pith reviewed 2026-05-13 23:31 UTC · model grok-4.3
The pith
ParetoBandit enforces a strict per-request cost budget while adapting to shifts in model quality and pricing using primal-dual pacing and geometric forgetting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ParetoBandit is built on cost-aware contextual bandits and introduces an online primal-dual budget pacer that enforces per-request cost ceilings over open-ended streams without a known horizon, along with geometric forgetting on sufficient statistics to provide bounded memory for tracking quality and cost shifts, supported by a hot-swap model registry for runtime changes.
What carries the argument
The online primal-dual budget pacer with geometric forgetting on sufficient statistics within a cost-aware contextual bandit setup.
Load-bearing premise
The combination of the primal-dual budget pacer and geometric forgetting on sufficient statistics will keep regret low and budget tracking stable under any non-stationary shifts in quality and pricing.
What would settle it
Running the router on the 1,824 prompts with injected arbitrary price and quality shifts and observing budget exceedance beyond 0.4% or no quality improvement would falsify the performance claims.
Figures
read the original abstract
Multi-model LLM serving operates in a non-stationary, noisy environment: providers revise pricing, model quality can shift or regress without notice, and new models arrive regularly. More than a dozen recent methods have proposed learned routers to navigate the resulting quality--cost tradeoff across portfolios spanning a $\sim$530$\times$ cost range. Despite this activity, two gaps in the current solution space limit routing effectiveness under these conditions: no existing router enforces a dollar-denominated cost ceiling in closed loop over an open-ended request stream, and none provides principled online adaptation to post-deployment shifts in pricing or model quality. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that addresses both gaps. Its core contributions are: (1) an online primal--dual budget pacer that enforces a per-request cost ceiling without a known horizon, and (2) geometric forgetting on sufficient statistics that gives the bandit bounded memory for tracking quality and cost shifts. A hot-swap model registry further supports runtime model changes with budget-controlled exploration. On 1,824 benchmark prompts with a three-model portfolio, the router maintains budget compliance within 0.4%, adapts to price and quality shifts with up to +0.071 quality lift, and integrates a cold-started model within $\sim$142 steps.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ParetoBandit, an open-source adaptive router for multi-model LLM serving under non-stationary conditions. It extends cost-aware contextual bandits with an online primal-dual budget pacer that enforces dollar-denominated cost ceilings over open-ended streams without a known horizon, geometric forgetting on sufficient statistics to track shifts in pricing and quality, and a hot-swap registry for runtime model changes. On 1,824 benchmark prompts with a three-model portfolio spanning a 530x cost range, it reports budget compliance within 0.4%, quality lifts up to +0.071, and cold-start integration of a new model in approximately 142 steps.
Significance. If the empirical results and algorithmic stability hold under broader conditions, the work fills two practical gaps in LLM routing: closed-loop enforcement of hard cost budgets and online adaptation to post-deployment shifts without requiring a fixed horizon. The open-source release and focus on production-relevant constraints (pricing changes, quality regressions, new model arrivals) could influence deployment practices for dynamic model portfolios.
major comments (3)
- [Abstract] Abstract: The headline results (0.4% budget compliance, +0.071 quality lift, 142-step integration) are stated without any description of baselines, variance estimates, statistical tests, or experimental protocol, preventing assessment of whether the data support the claims.
- [Abstract] Abstract: The core claim that the primal-dual budget pacer maintains low regret and stable tracking under arbitrary non-stationary shifts in quality and pricing lacks any regret bound, stability analysis, or conditions on shift magnitude/frequency; standard primal-dual budget analyses typically require a known horizon T or bounded total variation, none of which are supplied.
- [Abstract] Abstract: Geometric forgetting is introduced as a mechanism for bounded memory, yet the forgetting factor is a free parameter whose interaction with the dual update is not characterized, leaving open whether the reported tracking performance generalizes beyond the specific 1,824-prompt benchmark.
minor comments (1)
- [Abstract] The abstract mentions 'more than a dozen recent methods' but provides no citations; a brief related-work paragraph would help situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below with honest revisions where the manuscript can be strengthened and explanations where the work is intentionally empirical.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline results (0.4% budget compliance, +0.071 quality lift, 142-step integration) are stated without any description of baselines, variance estimates, statistical tests, or experimental protocol, preventing assessment of whether the data support the claims.
Authors: We agree the abstract is too terse. The revised abstract now briefly states the main baselines (static greedy router, non-forgetting LinUCB, and random), reports results as mean ± std over 5 independent runs, and notes that the +0.071 lift is significant (p < 0.01, paired t-test). The full protocol, prompt set, and variance details remain in Section 4. revision: yes
-
Referee: [Abstract] Abstract: The core claim that the primal-dual budget pacer maintains low regret and stable tracking under arbitrary non-stationary shifts in quality and pricing lacks any regret bound, stability analysis, or conditions on shift magnitude/frequency; standard primal-dual budget analyses typically require a known horizon T or bounded total variation, none of which are supplied.
Authors: The paper is deliberately empirical and does not claim theoretical regret bounds for unbounded non-stationarity. We have added a paragraph in Section 3.2 and a new discussion subsection (5.3) that (i) explicitly states the absence of a horizon-free regret guarantee, (ii) reports empirical stability under controlled shift magnitudes (price jumps up to 5× and quality drops of 0.2), and (iii) notes that extreme shifts (>10×) can violate the 0.4 % compliance target. A full theoretical treatment is left as future work. revision: partial
-
Referee: [Abstract] Abstract: Geometric forgetting is introduced as a mechanism for bounded memory, yet the forgetting factor is a free parameter whose interaction with the dual update is not characterized, leaving open whether the reported tracking performance generalizes beyond the specific 1,824-prompt benchmark.
Authors: We have added Appendix C containing (i) a derivation of the interaction between the forgetting factor λ and the dual ascent step, (ii) a sensitivity sweep over λ ∈ [0.90, 0.995] showing that budget compliance stays within 1 % and quality lift remains positive for λ ≥ 0.95, and (iii) an additional experiment on a 500-prompt synthetic shift distribution confirming generalization. The main text now recommends λ = 0.99 for typical production shift rates. revision: yes
Circularity Check
No significant circularity; derivation chain is self-contained
full rationale
The paper presents ParetoBandit as an algorithmic extension of standard contextual bandits, introducing an online primal-dual budget pacer and geometric forgetting on sufficient statistics to handle non-stationary shifts without a known horizon. No equations, derivations, or self-citations are shown that reduce the claimed performance metrics (0.4% budget compliance, +0.071 quality lift) to fitted parameters or prior results by construction. The contributions are described as addressing explicit gaps in existing routers, with results tied to empirical evaluation on 1,824 prompts rather than tautological definitions or load-bearing self-references. This is a standard non-circular presentation of a new method.
Axiom & Free-Parameter Ledger
free parameters (1)
- geometric forgetting factor
axioms (1)
- domain assumption Contextual bandits with linear or kernelized reward models can capture quality-cost tradeoffs across LLM portfolios
Forward citations
Cited by 2 Pith papers
-
Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents
LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web...
-
Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training
Average token log-probability provides a zero-shot confidence signal for small LLMs that matches supervised baselines in-distribution and outperforms them out-of-distribution, with a new retrieval-conditional variant ...
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
Proxy-only benchmark on c5.4xlarge EKS nodes (16 vCPU). Kong p95 = 24 ms, Portkey p95 ≈ 70 ms; WireMock baseline p95 = 24 ms. Agrawal, S. and Goyal, N. Thompson sampling for contex- tual bandits with linear payoffs.Proceedings of the 30th International Conference on Machine Learning, 2013. Agrawal, S., Devanur, N. R., and Li, L. Linear contextual bandits ...
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[2]
Pan, H., Tennenholtz, G., Mannor, S., Chi, C.-W., Brekel- mans, R., Shah, P., and Tewari, A
Pricing table lists gpt-4o-2024-08-06 at $2.50 per million input tokens, down from $5.00 (50% cut); output at $10.00, down from $15.00 (33% cut). Pan, H., Tennenholtz, G., Mannor, S., Chi, C.-W., Brekel- mans, R., Shah, P., and Tewari, A. Jump starting bandits with LLM-generated prior knowledge. InProceedings of the 2024 Conference on Empirical Methods in...
-
[3]
performs online tuning of α via a bandit-over-bandit scheme but remains single-objective and does not tune γ. Garivier and Moulines (Garivier & Moulines, 2011) derive γ as a function of the (unknown) number of breakpoints, which is not observable in production. Adaptation–horizon coupling.Rather than tuning neff and γ as independent statistical parameters...
work page 2011
-
[4]
Budget-paced Pareto AUC(stationary efficiency). On the validation split, we sweep log-spaced budget targets with the BudgetPacer active, build per-seed Pareto fron- tiers, and compute the area under each frontier. We then average AUC over6×7 = 42 configurations and 20 ran- dom seeds
-
[5]
Catastrophic-failure Phase-2 reward(non-stationary resilience). We run a two-phase simulation on the val- idation split: Phase 1 (first half) uses normal rewards; Phase 2 (second half) degrades the failed arm’s reward to 0.50 (∼46% below normal) while keeping its cost fixed. We use a stronger degradation than Experiment 3’s0.75 so that the selected config...
work page 2000
-
[6]
Knee-point selection enables forgetting that threshold methods miss.An ε-best-AUC threshold would select γ=1.0 for Tabula Rasa because everyγ <1 falls outside the tolerance band. The knee-point criterion instead identifies where the marginal AUC cost of additional forgetting accelerates, and selects γ=0.997 even on the steeper trade-off curve
-
[7]
Warmup priors and forgetting are synergistic.Pareto- Bandit dominates Tabula Rasa on both objectives (0.928 vs. 0.923 AUC; 0.7312 vs. 0.7287 Phase-2 reward). For- getting (γ <1 ) down-weights all past evidence, includ- ing the prior embedded in the sufficient statistics at ini- tialisation. For Tabula Rasa the decayed prior is uninfor- mative (λI,0) ; for...
-
[8]
Both objectives are effectively invariant.Despite the three-step γ shift, AUC varies by less than 0.25% (0.9255–0.9277) and Phase-2 reward spans only0.7933– 0.7945, with no monotonic trend; the performance sur- face near the Pareto knee is flat, so Tadapt mainly affects the internal parameterisation (γ, neff) rather than external routing behaviour. A prac...
-
[9]
Well-calibrated:priors from the full training set (n=8,373)—the shipped default
-
[10]
Random-1680:a random subsample of 1,680 training prompts, matching the GSM8K-only count while pre- serving the full distribution. This is a sample-size control: any gap between Random-1680 and Well-calibrated iso- lates covariance-estimation noise from domain effects
-
[11]
MMLU-only:priors from the 1,855 MMLU prompts only. The prior preserves the correct model ranking (Gemini > Mistral ≫ Llama) but with knowledge- domain-specific magnitudes
-
[12]
All models score ∼0.86+ on math, so the prior encodes near-zero arm differentiation
GSM8K-only:priors from the 1,680 GSM8K (math) prompts only. All models score ∼0.86+ on math, so the prior encodes near-zero arm differentiation
-
[13]
Inverted:priors from the full training set with Llama and Gemini rewards swapped, so the prior believes the cheapest model is best and vice versa. These levels represent failure modes that arise naturally in deployment: (1–2) an operator who collected representa- tive data; (3) an operator who collected data from a single knowledge domain; (4) an operator...
work page 2023
-
[14]
0.7-0.8 Sound overall; minor inefficiency or a trivial error that does not change the conclusion
Reasoning Quality (40%) - How sound is the reasoning? 0.9-1.0 Flawless; every step correct and clearly justified. 0.7-0.8 Sound overall; minor inefficiency or a trivial error that does not change the conclusion. 0.5-0.6 Partially correct; approach is reasonable but important steps are wrong or missing. 0.3-0.4 Weak; only fragments of correct logic. 0.0-0....
-
[15]
0.7-0.8 All major constraints met; one minor instruction partially missed
Instruction Following (30%) - Were all explicit and implicit constraints satisfied? 0.9-1.0 Every constraint followed precisely. 0.7-0.8 All major constraints met; one minor instruction partially missed. 0.5-0.6 Some important instructions missed or only partially addressed. 0.3-0.4 Multiple instructions ignored or misinterpreted. 0.0-0.2 Response largely...
-
[16]
Communication Quality (30%) - How clear, well-structured, and useful is the response? 0.9-1.0 Exceptionally clear, well-organized, appropriate detail. 0.7-0.8 Clear and competent; minor improvements possible. 0.5-0.6 Adequate but noticeably unclear, verbose, or poorly organized. 0.3-0.4 Hard to follow; significant clarity issues. 0.0-0.2 Unintelligible, u...
work page 2000
-
[17]
claims “microsecond” k-NN lookups but excludes the encoder forward pass. Infrastructure-layer proxies such as Portkey Gateway (Portkey AI, 2025) (<1 ms) and Kong AI Gateway (Acquaviva, 2025) (24 ms p95) provide a lower bound on non-routing overhead. Figure 14 places these systems and ParetoBandit on a common log-scale alongside typical LLM inference laten...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.