arxiv: 2604.00136 · v2 · submitted 2026-03-31 · 💻 cs.LG · cs.CL

Recognition: no theorem link

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

Annette Taberner-Miller

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:31 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM routingcontextual banditsbudget controlnon-stationaryadaptive systemsprimal-dualonline learning

0 comments

The pith

ParetoBandit enforces a strict per-request cost budget while adapting to shifts in model quality and pricing using primal-dual pacing and geometric forgetting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ParetoBandit as a router for multi-model LLM serving that operates in environments where pricing and model quality change over time. It solves the problem of enforcing a dollar cost ceiling without knowing the total number of future requests in advance. By combining a primal-dual budget pacer with geometric forgetting on statistics, the system tracks changes and maintains performance. This approach allows routers to stay within budget and improve quality when shifts occur, addressing gaps in previous methods that either lacked closed-loop budget control or adaptation mechanisms. On benchmarks, it shows tight compliance and quick adaptation to new models.

Core claim

ParetoBandit is built on cost-aware contextual bandits and introduces an online primal-dual budget pacer that enforces per-request cost ceilings over open-ended streams without a known horizon, along with geometric forgetting on sufficient statistics to provide bounded memory for tracking quality and cost shifts, supported by a hot-swap model registry for runtime changes.

What carries the argument

The online primal-dual budget pacer with geometric forgetting on sufficient statistics within a cost-aware contextual bandit setup.

Load-bearing premise

The combination of the primal-dual budget pacer and geometric forgetting on sufficient statistics will keep regret low and budget tracking stable under any non-stationary shifts in quality and pricing.

What would settle it

Running the router on the 1,824 prompts with injected arbitrary price and quality shifts and observing budget exceedance beyond 0.4% or no quality improvement would falsify the performance claims.

Figures

Figures reproduced from arXiv: 2604.00136 by Annette Taberner-Miller.

**Figure 1.** Figure 1: Quality–cost Pareto frontier under stationary conditions (K=3, 20 seeds). (a) ParetoBandit with BudgetPacer (blue curve) traces a continuous quality–cost frontier by accepting a dollar budget ceiling. Fixed single-model baselines shown as stars. (b) Budget compliance: realized cost vs. ceiling. Green band marks ±5%. (c) Model allocation shifts from Llama-dominant at tight budgets to Gemini-heavy at loose b… view at source ↗

**Figure 2.** Figure 2: Adaptation dynamics under cost drift (K=3, 20 seeds, 95% bootstrap CI). Normal pricing → Gemini price drop at prompt 608 → price restored at prompt 2×608. (a) Gemini-Pro selection fraction: all budget tiers surge toward Gemini during the price drop, then revert when pricing is restored. (b) Windowed mean reward: all conditions improve during Phase 2 as Gemini becomes budget-accessible. (c) Windowed average… view at source ↗

**Figure 4.** Figure 4: Model onboarding (K=3 → K=4; 20 seeds, 95% bootstrap CI). Flash windowed selection fraction across four budget levels after cold-start addition. (a) Good & cheap: Flash is adopted at all budgets. (b) Good & expensive: the BudgetPacer suppresses Flash under tight budgets. (c) Bad & cheap: the bandit correctly rejects Flash in every seed [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Budget–quality trade-off during model onboarding (Good & Cheap; 20 seeds, 95% bootstrap CI). (a) Running cost per request (dotted = targets). Tight and moderate track their ceilings through the K=3 → K=4 transition; loose settles below target as its constraint is only weakly binding. (b) Cumulative reward. Tight dips briefly at the Phase 2 boundary (20 forced-exploration pulls) before recovering; moderate … view at source ↗

**Figure 6.** Figure 6: Cost heuristic validation: realised cost distributions. Panel (a) shows per-request realised cost distributions for the K=3 portfolio on the shared validation subset, and panel (b) shows the same diagnostic for the K=4 portfolio. In both cases the violin plots are ordered by the static log-normalized heuristic c˜a, illustrating that the major cost tiers are cleanly separated, with only the Mistral-Large a… view at source ↗

**Figure 7.** Figure 7: Cost heuristic validation: ranking preservation. Panel (a) shows pairwise ranking preservation for the K=3 portfolio on the shared validation subset, and panel (b) shows the same diagnostic for the K=4 portfolio. The heuristic preserves the full perrequest cost ordering on 100.0% of shared-subset K=3 prompts and ∼79.7% of shared-subset K=4 prompts; the Mistral-Large vs. Gemini-Flash comparison is the onl… view at source ↗

**Figure 8.** Figure 8: Per-seed cumulative regret distributions for warmup priors (blue) vs. Tabula Rasa (orange) across four budget regimes (K=3, 20 seeds, held-out test split). Both conditions share γ=0.997 but differ in prior strength and exploration rate (deployment-level comparison). Gray star: Random baseline median (unconstrained only). Warmup priors yield tight, unimodal distributions; Tabula Rasa exhibits heavier tails… view at source ↗

**Figure 9.** Figure 9: Total cumulative regret for ParetoBandit across the prior-quality vs. prior-strength grid (20 seeds, unconstrained, test split). All conditions use the same ParetoBandit router; only the warmup prior varies. Tabula Rasa is the independently optimised no-prior baseline. Bold cells exceed the Tabula Rasa median; normal-weight cells fall below it. overridden quickly). • neff =100: median 89.9 [89.1, 91.3]. • … view at source ↗

**Figure 10.** Figure 10: Per-seed regret distributions across prior-quality levels at each neff value (20 seeds, unconstrained). Tabula Rasa is the independently optimised no-prior baseline. thropic), none of which overlaps with the routed model providers (Meta, Mistral, Google). E.2 Population-Level Agreement The bandit’s converged policy depends on the expected reward ordering E[r(x, a)] across arms, not on individual prompt sc… view at source ↗

**Figure 11.** Figure 11: Structured CoT evaluation prompt sent to all three judges. The composite reward is r = 0.4 sreason + 0.3 sinstr + 0.3 scomm. For DeepSeek-R1 only, the system message appends: “IMPORTANT: Keep the ## Reasoning section to 3--5 sentences. Be direct---identify errors or confirm correctness, then move to scoring.” Temperature is set to 0 for all judges. ing policies [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Cold-start bandit regret under three judges (1328 test prompts drawn from the 2000-prompt judge-robustness sample). (A) R1: 52% regret reduction. (B) GPT-4.1-mini: 54% reduction. (C) Claude-3.7-Sonnet: 61% reduction. Shaded bands show 95% bootstrap CIs over 20 seeds. Tabula Rasa bands are wider than the Random baseline’s due to higher per-seed variance from cold-start exploration. The paired difference (R… view at source ↗

**Figure 13.** Figure 13: Route latency (left), update latency (centre), and throughput (right) for eight router configurations on a log scale. Bare SM and Cached Inv. share identical route() code, so the route panel confirms they are equivalent; the update panel exposes Sherman-Morrison’s actual contribution (O(d 2 ) vs. O(d 3 )). ParetoBandit bars quantify the additional cost of production features. execute literally the same ro… view at source ↗

**Figure 14.** Figure 14: End-to-end routing latency (p50, single query) for published routing systems vs. LLM inference time (shaded region). finite-horizon recovery envelope: for a given Phase 3 length, the degradation severity up to which the system reaches near-Phase 1 performance, and how extending the horizon shifts that boundary. Setup. We sweep Mistral-Large’s degraded reward from 0.05 to 0.85 (normal reward ≈0.92). Degrad… view at source ↗

**Figure 15.** Figure 15: Recovery limit under quality degradation (moderate budget, 20 seeds, 95% bootstrap CI). (a) P3/P1 reward ratio vs. degradation severity; dotted line = 97% full-recovery threshold. (b) Phase-3 recovery dynamics (50-prompt rolling window) at the extended horizon for four severity levels. ble convergence rather than high-variance oscillation. The monotonic, non-plateauing trajectories are consistent with rec… view at source ↗

read the original abstract

Multi-model LLM serving operates in a non-stationary, noisy environment: providers revise pricing, model quality can shift or regress without notice, and new models arrive regularly. More than a dozen recent methods have proposed learned routers to navigate the resulting quality--cost tradeoff across portfolios spanning a $\sim$530$\times$ cost range. Despite this activity, two gaps in the current solution space limit routing effectiveness under these conditions: no existing router enforces a dollar-denominated cost ceiling in closed loop over an open-ended request stream, and none provides principled online adaptation to post-deployment shifts in pricing or model quality. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that addresses both gaps. Its core contributions are: (1) an online primal--dual budget pacer that enforces a per-request cost ceiling without a known horizon, and (2) geometric forgetting on sufficient statistics that gives the bandit bounded memory for tracking quality and cost shifts. A hot-swap model registry further supports runtime model changes with budget-controlled exploration. On 1,824 benchmark prompts with a three-model portfolio, the router maintains budget compliance within 0.4%, adapts to price and quality shifts with up to +0.071 quality lift, and integrates a cold-started model within $\sim$142 steps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ParetoBandit adds a horizon-free primal-dual budget pacer and geometric forgetting to contextual bandits for LLM routing, but the abstract supplies no analysis or experimental details to back the stability claims.

read the letter

The paper introduces an online primal-dual pacer that enforces a per-request dollar budget without a known horizon, combined with geometric forgetting on sufficient statistics to let the bandit track shifts in model quality and pricing. It also includes a hot-swap registry for adding models at runtime while controlling exploration cost. These pieces directly target two gaps the authors identify: closed-loop budget enforcement over open streams and principled adaptation after deployment. On the 1,824-prompt benchmark with a three-model portfolio the reported outcomes are 0.4% budget compliance, up to 0.071 quality lift, and integration of a new model in roughly 142 steps. That combination of algorithmic pieces is not present in the cited prior work on LLM routers. The practical focus on production constraints like unknown request volume and runtime model changes is useful for anyone running multi-model serving systems. The main soft spot is the complete absence of supporting analysis. Standard primal-dual budget methods require either a known horizon, bounded total variation, or Lipschitz conditions to control regret, yet none of these are stated or derived here. The interaction between the geometric forgetting factor and the dual updates is also uncharacterized, so it is unclear whether the 0.4% compliance holds under arbitrary non-stationary shifts or is tied to the specific benchmark. No baselines, variance estimates, or statistical tests appear in the provided text, which makes the empirical numbers hard to evaluate. This paper is for engineers and researchers working on cost-aware LLM serving who need concrete mechanisms for budget control and adaptation. It deserves peer review so the full derivations, experimental controls, and any regret statements can be checked.

Referee Report

3 major / 1 minor

Summary. The paper introduces ParetoBandit, an open-source adaptive router for multi-model LLM serving under non-stationary conditions. It extends cost-aware contextual bandits with an online primal-dual budget pacer that enforces dollar-denominated cost ceilings over open-ended streams without a known horizon, geometric forgetting on sufficient statistics to track shifts in pricing and quality, and a hot-swap registry for runtime model changes. On 1,824 benchmark prompts with a three-model portfolio spanning a 530x cost range, it reports budget compliance within 0.4%, quality lifts up to +0.071, and cold-start integration of a new model in approximately 142 steps.

Significance. If the empirical results and algorithmic stability hold under broader conditions, the work fills two practical gaps in LLM routing: closed-loop enforcement of hard cost budgets and online adaptation to post-deployment shifts without requiring a fixed horizon. The open-source release and focus on production-relevant constraints (pricing changes, quality regressions, new model arrivals) could influence deployment practices for dynamic model portfolios.

major comments (3)

[Abstract] Abstract: The headline results (0.4% budget compliance, +0.071 quality lift, 142-step integration) are stated without any description of baselines, variance estimates, statistical tests, or experimental protocol, preventing assessment of whether the data support the claims.
[Abstract] Abstract: The core claim that the primal-dual budget pacer maintains low regret and stable tracking under arbitrary non-stationary shifts in quality and pricing lacks any regret bound, stability analysis, or conditions on shift magnitude/frequency; standard primal-dual budget analyses typically require a known horizon T or bounded total variation, none of which are supplied.
[Abstract] Abstract: Geometric forgetting is introduced as a mechanism for bounded memory, yet the forgetting factor is a free parameter whose interaction with the dual update is not characterized, leaving open whether the reported tracking performance generalizes beyond the specific 1,824-prompt benchmark.

minor comments (1)

[Abstract] The abstract mentions 'more than a dozen recent methods' but provides no citations; a brief related-work paragraph would help situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below with honest revisions where the manuscript can be strengthened and explanations where the work is intentionally empirical.

read point-by-point responses

Referee: [Abstract] Abstract: The headline results (0.4% budget compliance, +0.071 quality lift, 142-step integration) are stated without any description of baselines, variance estimates, statistical tests, or experimental protocol, preventing assessment of whether the data support the claims.

Authors: We agree the abstract is too terse. The revised abstract now briefly states the main baselines (static greedy router, non-forgetting LinUCB, and random), reports results as mean ± std over 5 independent runs, and notes that the +0.071 lift is significant (p < 0.01, paired t-test). The full protocol, prompt set, and variance details remain in Section 4. revision: yes
Referee: [Abstract] Abstract: The core claim that the primal-dual budget pacer maintains low regret and stable tracking under arbitrary non-stationary shifts in quality and pricing lacks any regret bound, stability analysis, or conditions on shift magnitude/frequency; standard primal-dual budget analyses typically require a known horizon T or bounded total variation, none of which are supplied.

Authors: The paper is deliberately empirical and does not claim theoretical regret bounds for unbounded non-stationarity. We have added a paragraph in Section 3.2 and a new discussion subsection (5.3) that (i) explicitly states the absence of a horizon-free regret guarantee, (ii) reports empirical stability under controlled shift magnitudes (price jumps up to 5× and quality drops of 0.2), and (iii) notes that extreme shifts (>10×) can violate the 0.4 % compliance target. A full theoretical treatment is left as future work. revision: partial
Referee: [Abstract] Abstract: Geometric forgetting is introduced as a mechanism for bounded memory, yet the forgetting factor is a free parameter whose interaction with the dual update is not characterized, leaving open whether the reported tracking performance generalizes beyond the specific 1,824-prompt benchmark.

Authors: We have added Appendix C containing (i) a derivation of the interaction between the forgetting factor λ and the dual ascent step, (ii) a sensitivity sweep over λ ∈ [0.90, 0.995] showing that budget compliance stays within 1 % and quality lift remains positive for λ ≥ 0.95, and (iii) an additional experiment on a 500-prompt synthetic shift distribution confirming generalization. The main text now recommends λ = 0.99 for typical production shift rates. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper presents ParetoBandit as an algorithmic extension of standard contextual bandits, introducing an online primal-dual budget pacer and geometric forgetting on sufficient statistics to handle non-stationary shifts without a known horizon. No equations, derivations, or self-citations are shown that reduce the claimed performance metrics (0.4% budget compliance, +0.071 quality lift) to fitted parameters or prior results by construction. The contributions are described as addressing explicit gaps in existing routers, with results tied to empirical evaluation on 1,824 prompts rather than tautological definitions or load-bearing self-references. This is a standard non-circular presentation of a new method.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard contextual bandit assumptions plus two new algorithmic mechanisms whose parameters are not enumerated in the abstract; no new physical entities are postulated.

free parameters (1)

geometric forgetting factor
Controls memory decay rate for tracking shifts; value not stated in abstract but required for the adaptation claim.

axioms (1)

domain assumption Contextual bandits with linear or kernelized reward models can capture quality-cost tradeoffs across LLM portfolios
Invoked by the choice of cost-aware contextual bandits as the base learner.

pith-pipeline@v0.9.0 · 5531 in / 1465 out tokens · 56221 ms · 2026-05-13T23:31:56.436466+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents
cs.LG 2026-05 unverdicted novelty 6.0

LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web...
Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training
cs.AI 2026-05 conditional novelty 6.0

Average token log-probability provides a zero-shot confidence signal for small LLMs that matches supervised baselines in-distribution and outperforms them out-of-distribution, with a new retrieval-conditional variant ...

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

Program Synthesis with Large Language Models

Proxy-only benchmark on c5.4xlarge EKS nodes (16 vCPU). Kong p95 = 24 ms, Portkey p95 ≈ 70 ms; WireMock baseline p95 = 24 ms. Agrawal, S. and Goyal, N. Thompson sampling for contex- tual bandits with linear payoffs.Proceedings of the 30th International Conference on Machine Learning, 2013. Agrawal, S., Devanur, N. R., and Li, L. Linear contextual bandits ...

work page internal anchor Pith review Pith/arXiv arXiv 2013
[2]

Pan, H., Tennenholtz, G., Mannor, S., Chi, C.-W., Brekel- mans, R., Shah, P., and Tewari, A

Pricing table lists gpt-4o-2024-08-06 at $2.50 per million input tokens, down from $5.00 (50% cut); output at $10.00, down from $15.00 (33% cut). Pan, H., Tennenholtz, G., Mannor, S., Chi, C.-W., Brekel- mans, R., Shah, P., and Tewari, A. Jump starting bandits with LLM-generated prior knowledge. InProceedings of the 2024 Conference on Empirical Methods in...

work page arXiv 2024
[3]

Garivier and Moulines (Garivier & Moulines, 2011) derive γ as a function of the (unknown) number of breakpoints, which is not observable in production

performs online tuning of α via a bandit-over-bandit scheme but remains single-objective and does not tune γ. Garivier and Moulines (Garivier & Moulines, 2011) derive γ as a function of the (unknown) number of breakpoints, which is not observable in production. Adaptation–horizon coupling.Rather than tuning neff and γ as independent statistical parameters...

work page 2011
[4]

On the validation split, we sweep log-spaced budget targets with the BudgetPacer active, build per-seed Pareto fron- tiers, and compute the area under each frontier

Budget-paced Pareto AUC(stationary efficiency). On the validation split, we sweep log-spaced budget targets with the BudgetPacer active, build per-seed Pareto fron- tiers, and compute the area under each frontier. We then average AUC over6×7 = 42 configurations and 20 ran- dom seeds

work page
[5]

Catastrophic-failure Phase-2 reward(non-stationary resilience). We run a two-phase simulation on the val- idation split: Phase 1 (first half) uses normal rewards; Phase 2 (second half) degrades the failed arm’s reward to 0.50 (∼46% below normal) while keeping its cost fixed. We use a stronger degradation than Experiment 3’s0.75 so that the selected config...

work page 2000
[6]

The knee-point criterion instead identifies where the marginal AUC cost of additional forgetting accelerates, and selects γ=0.997 even on the steeper trade-off curve

Knee-point selection enables forgetting that threshold methods miss.An ε-best-AUC threshold would select γ=1.0 for Tabula Rasa because everyγ <1 falls outside the tolerance band. The knee-point criterion instead identifies where the marginal AUC cost of additional forgetting accelerates, and selects γ=0.997 even on the steeper trade-off curve

work page
[7]

0.923 AUC; 0.7312 vs

Warmup priors and forgetting are synergistic.Pareto- Bandit dominates Tabula Rasa on both objectives (0.928 vs. 0.923 AUC; 0.7312 vs. 0.7287 Phase-2 reward). For- getting (γ <1 ) down-weights all past evidence, includ- ing the prior embedded in the sufficient statistics at ini- tialisation. For Tabula Rasa the decayed prior is uninfor- mative (λI,0) ; for...

work page
[8]

Prompt-cost correlation

Both objectives are effectively invariant.Despite the three-step γ shift, AUC varies by less than 0.25% (0.9255–0.9277) and Phase-2 reward spans only0.7933– 0.7945, with no monotonic trend; the performance sur- face near the Pareto knee is flat, so Tadapt mainly affects the internal parameterisation (γ, neff) rather than external routing behaviour. A prac...

work page
[9]

Well-calibrated:priors from the full training set (n=8,373)—the shipped default

work page
[10]

This is a sample-size control: any gap between Random-1680 and Well-calibrated iso- lates covariance-estimation noise from domain effects

Random-1680:a random subsample of 1,680 training prompts, matching the GSM8K-only count while pre- serving the full distribution. This is a sample-size control: any gap between Random-1680 and Well-calibrated iso- lates covariance-estimation noise from domain effects

work page
[11]

The prior preserves the correct model ranking (Gemini > Mistral ≫ Llama) but with knowledge- domain-specific magnitudes

MMLU-only:priors from the 1,855 MMLU prompts only. The prior preserves the correct model ranking (Gemini > Mistral ≫ Llama) but with knowledge- domain-specific magnitudes

work page
[12]

All models score ∼0.86+ on math, so the prior encodes near-zero arm differentiation

GSM8K-only:priors from the 1,680 GSM8K (math) prompts only. All models score ∼0.86+ on math, so the prior encodes near-zero arm differentiation

work page
[13]

Inverted:priors from the full training set with Llama and Gemini rewards swapped, so the prior believes the cheapest model is best and vice versa. These levels represent failure modes that arise naturally in deployment: (1–2) an operator who collected representa- tive data; (3) an operator who collected data from a single knowledge domain; (4) an operator...

work page 2023
[14]

0.7-0.8 Sound overall; minor inefficiency or a trivial error that does not change the conclusion

Reasoning Quality (40%) - How sound is the reasoning? 0.9-1.0 Flawless; every step correct and clearly justified. 0.7-0.8 Sound overall; minor inefficiency or a trivial error that does not change the conclusion. 0.5-0.6 Partially correct; approach is reasonable but important steps are wrong or missing. 0.3-0.4 Weak; only fragments of correct logic. 0.0-0....

work page
[15]

0.7-0.8 All major constraints met; one minor instruction partially missed

Instruction Following (30%) - Were all explicit and implicit constraints satisfied? 0.9-1.0 Every constraint followed precisely. 0.7-0.8 All major constraints met; one minor instruction partially missed. 0.5-0.6 Some important instructions missed or only partially addressed. 0.3-0.4 Multiple instructions ignored or misinterpreted. 0.0-0.2 Response largely...

work page
[16]

IMPORTANT: Keep the ## Reasoning section to 3--5 sentences. Be direct---identify errors or confirm correctness, then move to scoring

Communication Quality (30%) - How clear, well-structured, and useful is the response? 0.9-1.0 Exceptionally clear, well-organized, appropriate detail. 0.7-0.8 Clear and competent; minor improvements possible. 0.5-0.6 Adequate but noticeably unclear, verbose, or poorly organized. 0.3-0.4 Hard to follow; significant clarity issues. 0.0-0.2 Unintelligible, u...

work page 2000
[17]

microsecond

claims “microsecond” k-NN lookups but excludes the encoder forward pass. Infrastructure-layer proxies such as Portkey Gateway (Portkey AI, 2025) (<1 ms) and Kong AI Gateway (Acquaviva, 2025) (24 ms p95) provide a lower bound on non-routing overhead. Figure 14 places these systems and ParetoBandit on a common log-scale alongside typical LLM inference laten...

work page 2025