Neural Bandit Based Optimal LLM Selection for a Pipeline of Subtasks

Baran Atalar; Carlee Joe-Wong; Eddie Zhang

arxiv: 2508.09958 · v3 · submitted 2025-08-13 · 💻 cs.CL · cs.LG

Neural Bandit Based Optimal LLM Selection for a Pipeline of Subtasks

Baran Atalar , Eddie Zhang , Carlee Joe-Wong This is my paper

Pith reviewed 2026-05-18 22:40 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords LLM selectioncontextual banditsneural networkssubtask pipelinesagentic AI tasksregret boundssequential decisions

0 comments

The pith

Neural contextual bandits select LLMs for each subtask in a pipeline by learning output dependencies without any historical performance data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Sequential Bandits algorithm that treats LLM selection across dependent subtasks as a contextual bandit problem. Neural networks inside the algorithm learn how the output quality of one subtask changes the success probability and cost of the next subtask. A reader would care because many agentic tasks, such as medical record summarization followed by diagnosis, require choosing different models at each step. The method proves that total regret stays sublinear in the number of tasks handled, and experiments on two real datasets show better results than existing selection approaches.

Core claim

The central claim is that a neural contextual bandit model can capture the complex performance dependencies created when one subtask's output becomes the next subtask's input, allowing effective LLM selection for the entire pipeline from scratch and delivering a sublinear regret bound in the number of tasks.

What carries the argument

Sequential Bandits algorithm that trains neural networks to map context features of each subtask to LLM choices while accounting for downstream effects on later subtasks.

If this is right

The approach removes the need to collect historical LLM performance data before deploying a pipeline.
Sublinear regret implies that the average extra cost per task decreases as more tasks are processed.
The same framework applies to any agentic workflow that decomposes queries into ordered subtasks.
Experimental gains on real datasets indicate immediate cost savings for tasks such as medical record processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar neural bandit routing could extend to tool-calling sequences where each tool output feeds the next tool.
The method raises the question of whether the learned dependency model can be reused across related but non-identical pipelines.
Longer pipelines may require additional mechanisms to prevent error accumulation that the current regret analysis does not address.

Load-bearing premise

The neural network can learn the performance links between subtasks from online interactions alone without any pre-existing data on the target pipeline.

What would settle it

Running the algorithm on a fresh pipeline dataset and finding that cumulative regret grows linearly with the number of tasks instead of sublinearly would falsify the guarantee.

read the original abstract

As large language models (LLMs) become increasingly popular, there is a growing need to predict which out of a set of LLMs will yield a successful answer to a given query at low cost. This problem promises to become even more relevant as LLM agents are asked to solve an increasing variety of "agentic'' AI tasks. Such tasks are often broken into smaller subtasks, each of which can then be executed by a LLM expected to perform well on that specific subtask. For example, to extract a diagnosis from medical records, one can first select an LLM to summarize the record, select another to validate the summary, and then select a possibly different LLM to extract the diagnosis from the summarized record. Unlike existing LLM selection or routing algorithms, this setting requires selecting a sequence of LLMs, with the output of each LLM feeding into the next and potentially influencing its success. Thus, unlike single LLM selection, the quality of each subtask's output directly affects the inputs, and hence the cost and success rate, of downstream LLMs, creating complex performance dependencies that must be learned during selection. We propose a neural contextual bandit-based algorithm that trains neural networks to guide LLM selections for the different subtasks, without requiring historical LLM performance data. We prove that our proposed Sequential Bandits algorithm achieves a sublinear regret in the number of tasks, and we experimentally validate its superior performance compared to other LLM selection algorithms on two real datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames sequential LLM selection in pipelines as a dependent bandit problem and offers a neural algorithm with claimed sublinear regret plus experiments, but the context dependence from chained outputs is the part that needs the closest look.

read the letter

The core contribution is treating LLM choice across subtasks as a sequence where each output becomes the next input, so selections are coupled. The authors give a neural contextual bandit that learns without any historical performance data on the target pipeline and prove sublinear regret in the number of tasks. Experiments on two real datasets show it beats standard selection baselines in practice. That framing is useful because most existing routing work ignores how one subtask's result changes the difficulty and cost of the next one, which is exactly how agentic pipelines behave. The no-prior-data requirement also fits real deployment where you start fresh on a new workflow. The experiments give some evidence the approach can deliver lower cost or higher success than simpler alternatives. The main soft spot is the regret analysis. Contexts here are not independent or fixed; the input to subtask k depends on the quality of the LLM chosen for subtask k-1, creating a selection-dependent Markov process. Standard neural bandit regret bounds usually assume i.i.d. or adversarial but exogenous contexts. If the proof relies on those assumptions without extra steps to handle the endogenous dependence, the sublinear guarantee does not automatically carry over. The abstract does not spell out the proof technique or how context is encoded, so that needs checking in the full version. This paper is for researchers working on LLM routing, agentic systems, or bandits applied to sequential decisions. A reader who already knows contextual bandits and cares about multi-step LLM pipelines will find the setting and the practical motivation clear. It deserves a serious referee because the problem is timely, the method is grounded in an established framework, and the experiments provide a starting point for validation. I would send it to review but flag the dependence issue for the authors to address explicitly.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a neural contextual bandit algorithm called Sequential Bandits for selecting LLMs across a pipeline of subtasks, where each subtask's output becomes the input to the next. It claims that the method learns to guide selections without any historical LLM performance data, proves that the algorithm achieves sublinear regret in the number of tasks, and experimentally outperforms other LLM selection algorithms on two real datasets.

Significance. If the sublinear regret result holds under the pipeline's sequential dependencies, the work would be significant for cost-efficient LLM routing in agentic and multi-step tasks. The no-historical-data requirement and neural function approximation for capturing cross-subtask performance effects are practically attractive strengths. Experimental validation on real datasets provides supporting evidence of utility, though the overall impact hinges on the robustness of the theoretical analysis.

major comments (1)

[Theoretical analysis / regret bound] The central sublinear regret claim for Sequential Bandits must be examined for validity under endogenous context dependence. The context for subtask k incorporates the (potentially low-quality) output of subtask k-1, inducing a Markovian process across rounds that is shaped by prior selections. Standard neural contextual bandit analyses (e.g., via neural tangent kernel or eluder dimension) typically assume i.i.d. or adversarially bounded but exogenous contexts; it is unclear whether the proof technique extends to this selection-dependent setting without additional assumptions or modifications.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief description of the neural architecture (e.g., input encoding of contexts and outputs) and the precise baselines used in the experiments.
[Introduction / Problem formulation] Notation for the pipeline state and how rewards are defined across subtasks should be clarified early to aid readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address the major comment below regarding the theoretical analysis.

read point-by-point responses

Referee: [Theoretical analysis / regret bound] The central sublinear regret claim for Sequential Bandits must be examined for validity under endogenous context dependence. The context for subtask k incorporates the (potentially low-quality) output of subtask k-1, inducing a Markovian process across rounds that is shaped by prior selections. Standard neural contextual bandit analyses (e.g., via neural tangent kernel or eluder dimension) typically assume i.i.d. or adversarially bounded but exogenous contexts; it is unclear whether the proof technique extends to this selection-dependent setting without additional assumptions or modifications.

Authors: We appreciate the referee for identifying this subtlety in the analysis. Our proof of sublinear regret (Theorem 3) explicitly models the pipeline as inducing a controlled Markov process, where each context is a deterministic function of the prior subtask output and selection. We extend the standard neural contextual bandit analysis by bounding the eluder dimension of the composed function class under the Lipschitz continuity assumption on the reward model; this controls the dependence without requiring i.i.d. contexts. The regret decomposition separates the neural approximation error from the exploration term, yielding an overall sublinear bound. We agree that additional exposition would strengthen the presentation and will revise the theoretical section to include a dedicated paragraph clarifying the adaptation to endogenous contexts. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper proposes a new neural contextual bandit algorithm for sequential LLM selection in pipelines and states a sublinear regret bound as a theoretical result. This bound is presented as a proof under the algorithm's assumptions rather than a quantity fitted to data or reduced by construction to the inputs. The claim of operating without historical performance data is an explicit design choice for online learning, not a self-referential definition. No load-bearing self-citations, ansatz smuggling, or renaming of known results appear in the abstract or described claims. The derivation for the algorithm and regret analysis is self-contained with independent theoretical content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on standard online learning assumptions plus the modeling choice that neural networks can approximate the unknown performance mapping across subtasks. No explicit free parameters or invented entities are named.

axioms (2)

domain assumption Neural networks can sufficiently approximate the unknown reward functions that map context and previous outputs to LLM success probability and cost.
Implicit in the use of neural contextual bandits to learn without historical data.
standard math The regret analysis for the sequential bandit setting holds under the standard assumptions of contextual bandits with function approximation.
Required for the claimed sublinear regret bound.

pith-pipeline@v0.9.0 · 5792 in / 1457 out tokens · 38213 ms · 2026-05-18T22:40:15.647381+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a neural contextual bandit-based algorithm that trains neural networks to guide LLM selections for the different subtasks... We prove that our proposed Sequential Bandits algorithm achieves a sublinear regret
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

rt(ai,j) = hi,j(xt(ai,j)) + ξt ... fi,j(x; θ) = √m W(L) σ(...)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.