ZEBRA: Zero-shot Budgeted Resource Allocation for LLM Orchestration

Inbal Talgam-Cohen; May Hamri

arxiv: 2605.20485 · v1 · pith:I5S43C55new · submitted 2026-05-19 · 💻 cs.LG

ZEBRA: Zero-shot Budgeted Resource Allocation for LLM Orchestration

May Hamri , Inbal Talgam-Cohen This is my paper

Pith reviewed 2026-05-21 06:57 UTC · model grok-4.3

classification 💻 cs.LG

keywords zero-shot resource allocationLLM orchestrationmulti-agent pipelinesbudgeted optimizationutility curve estimationwater-filling algorithmknapsack formulation

0 comments

The pith

ZEBRA reduces multi-phase LLM budget allocation to a nonlinear knapsack problem solved by water-filling after zero-shot utility curve estimation by an LLM controller.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors present ZEBRA, a framework that lets an LLM orchestrate budget spending across the phases of a larger agent pipeline without any task-specific training. It works by asking the LLM to sketch how much quality each phase would deliver at different spend levels, then feeds those curves into a continuous optimization routine that finds the best division of a fixed total budget. A reader would care because many real-world agent deployments must stay inside monetary limits, and simply letting the LLM pick the split wastes performance. The method is shown to keep more of the unconstrained quality than direct LLM decisions on both a large coding benchmark and a multi-hop QA pipeline, and it automatically chooses different splits for the two task types.

Core claim

ZEBRA frames budgeted allocation across pipeline phases as a continuous nonlinear knapsack problem. The LLM controller is prompted zero-shot to produce per-phase utility curves, after which water-filling search over the Lagrange multiplier yields the optimal per-phase spend levels. Additive and multiplicative quality combinations are handled by the same procedure. On the APPS benchmark at half the unconstrained budget the approach retains 94.4 percent of full quality compared with 88.1 percent for direct LLM allocation, with the difference statistically significant. The same procedure improves a three-phase HotpotQA pipeline by 14.3 percentage points and produces a more balanced split suited

What carries the argument

Zero-shot estimation of per-phase utility curves by an LLM controller followed by water-filling search on the Lagrange multiplier to solve the continuous nonlinear knapsack problem.

If this is right

At a budget equal to half the unconstrained spend, 94.4 percent of unconstrained quality is recovered on APPS versus 88.1 percent for LLM-direct.
The advantage transfers to HotpotQA, yielding a 14.3 percentage point gain over direct allocation.
Computed allocations adapt to pipeline structure, skewing toward refinement on coding tasks but remaining balanced on QA tasks.
The allocation remains effective even when the estimated utility curves contain estimation noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adding a lightweight optimization step at inference time can measurably improve the economic efficiency of autonomous LLM agents.
The curve-estimation-plus-search pattern may generalize to other constraints such as total response time or token budget.
Dynamic reallocation during execution could become feasible if the utility estimates can be updated on the fly.

Load-bearing premise

Zero-shot prompting produces utility curves accurate enough for the water-filling solver to return a near-optimal budget split.

What would settle it

A test that collects ground-truth quality-versus-cost data for each phase on a validation set and demonstrates that the LLM estimates are inaccurate enough to make the resulting allocation inferior to a uniform budget split.

Figures

Figures reproduced from arXiv: 2605.20485 by Inbal Talgam-Cohen, May Hamri.

**Figure 1.** Figure 1: ZEBRA overview. Given an input task and a fixed total budget (e.g., $0.10), ZEBRA adds an allocation agent that prompts an LLM to estimate a performance–budget curve for each workflow phase (Plan, Decompose, Implement, Refine). ZEBRA solves a continuous nonlinear knapsack via water-filling to allocate the global budget across phases before execution. The shared multi-phase pipeline then runs under these pe… view at source ↗

**Figure 2.** Figure 2: Per-phase utility curves and water-filling. Left: the saturating exponential fi(x) = ai(1 − e −bix ) for the four pipeline phases, with the quality ceiling ai shown as a dotted line. Phases differ in how quickly they saturate (the bi parameter): plan reaches its ceiling fastest, refine slowest. Right: the corresponding marginal-utility curves f ′ i (x) = aibie −bix . ZEBRA’s knapsack solution is precisely … view at source ↗

**Figure 3.** Figure 3: Refine-phase calibration: controller-predicted curve shape vs. empirical ∆score from refine. Empirical points are mean ∆score = final − implement within ten equal-count quantile bins of realized refine spend, with standard-error bars; runs are capped at 30 per task before binning. The dashed curve is the controller’s predicted shape a(1 − e −bx) at the tier-average cost-adjusted b, normalized so its platea… view at source ↗

**Figure 4.** Figure 4: Per-task scores: best ZEBRA variant vs LLM-direct. Each dot is a task; y-axis is the best ZEBRA variant’s per-task mean score, x-axis is LLM-direct’s. Dashed line is y = x. Points above the diagonal are tasks where ZEBRA wins. Easy tasks (blue) cluster near the diagonal in the upper right; the per-task gap grows as tasks get harder, with a visible swarm of red points well above the diagonal at low LLM scor… view at source ↗

**Figure 5.** Figure 5: NB retention vs budget tightness. Fraction of unconstrained quality recovered by each strategy as the budget multiplier α tightens. The ZEBRA-vs-LLM gap is essentially zero at α = 0.8 on easy tasks and grows as either the budget tightens (easy: α = 0.5 → 0.3) or the tier hardens (α = 0.5, easy → medium → hard). than at α = 0.5 on the same easy tier (30.5%, 45.1%); LLM-direct’s refine share is essentially u… view at source ↗

**Figure 6.** Figure 6: Allocation distributions at α = 0.5. Stacked bars show the mean fraction of the total budget spent on each phase, per strategy, on easy (left) versus medium+hard (right) tasks. ZEBRA shifts spend from implement (easy) to refine (medium+hard); LLM-direct uses a near-identical split in both regimes. downstream phases than late phases – we introduce dependency weights wi > 0 and maximize: max x Yn i=1 [PITH_… view at source ↗

read the original abstract

As autonomous agents increasingly execute end-to-end tasks under fixed monetary budgets, the pressing open question shifts from whether the budget is respected, to how to spend it effectively. Existing budget-aware methods typically control reasoning step-by-step within a single agent, or learn resource allocation policies via RL. None address how to split a budget across the composing phases of a multi-agent pipeline at inference time. We propose ZEBRA, a zero-shot framework that reduces multi-phase budget allocation to a continuous nonlinear knapsack problem: an LLM controller estimates per-phase utility curves, and a water-filling search on the Lagrange multiplier returns the per-phase split. Additive and multiplicative aggregations are unified under the same solver. On a $150$-task APPS coding benchmark, both ZEBRA variants outperform LLM-direct (budget allocation directly by an LLM) on every aggregate metric. At a budget of $\alpha = 0.5$ of the unconstrained spend, ZEBRA recovers $94.4\%$ of unconstrained quality, versus $88.1\%$ for LLM-direct. The advantage is statistically significant and transfers beyond coding: on a $3$-phase HotpotQA pipeline, ZEBRA beats LLM-direct by $14.3$pp, with allocations empirically robust to curve-estimation noise. On HotpotQA, ZEBRA arrives at a different budget split (near-balanced) compared to the APPS one (skewed towards a refinement phase), showing adaptation to the pipeline structure. More broadly, we show that lightweight algorithmic guidance at inference time can improve the economic behavior of autonomous multi-agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZEBRA gives a clean zero-shot way to split budgets across multi-phase LLM pipelines via knapsack and water-filling on LLM curves, with solid reported gains, but the curve accuracy is not directly checked against real outcomes.

read the letter

The main thing here is that ZEBRA reduces multi-phase budget allocation to a continuous nonlinear knapsack problem solved by water-filling on per-phase utility curves that an LLM estimates from zero-shot prompts. It unifies additive and multiplicative quality measures under one solver and shows the allocations adapt to the pipeline—skewed on APPS but more even on HotpotQA. On the APPS benchmark at half the unconstrained budget it recovers 94.4% of full quality versus 88.1% for direct LLM allocation, with the difference called statistically significant, and the lift transfers to HotpotQA at 14 points. The zero-shot framing and lack of any training step make it lightweight for inference-time use. That is the concrete advance over step-by-step control or learned RL policies mentioned in the abstract. The experiments are straightforward and the numbers are presented clearly enough to be worth trying to reproduce. The soft spot is the unvalidated utility curves. The method assumes the LLM's zero-shot estimates are accurate enough for the solver to produce near-optimal splits, yet there is no reported comparison of predicted versus actual task quality at the chosen budgets. The paper notes robustness to curve noise but does not quantify prediction error on the real data, so it remains possible that the gains come from the prompting style or the search heuristic rather than reliable modeling. If the full manuscript has additional checks on this point they would tighten the central claim; otherwise it is a gap that revisions could close without changing the overall approach. This paper is for applied researchers and engineers who need cost-aware orchestration of existing multi-agent LLM pipelines. A reader working on deployment economics or inference optimization would get immediate value from the algorithm and the two-benchmark results. It has enough of a distinct framing and empirical support to deserve peer review even if the curve validation needs more work.

Referee Report

2 major / 2 minor

Summary. The paper proposes ZEBRA, a zero-shot framework for splitting a fixed monetary budget across phases of a multi-agent LLM pipeline. An LLM controller estimates per-phase utility curves from zero-shot prompts; these curves are fed to a water-filling search over the Lagrange multiplier that solves the resulting continuous nonlinear knapsack problem and returns the per-phase allocation. Both additive and multiplicative aggregations are handled by the same solver. On a 150-task APPS benchmark, ZEBRA recovers 94.4% of unconstrained quality at α=0.5 budget versus 88.1% for direct LLM allocation, with the difference statistically significant; the method transfers to a 3-phase HotpotQA pipeline, yielding a 14.3 pp gain and a qualitatively different (near-balanced) split.

Significance. If the zero-shot utility estimates are reliable, ZEBRA supplies a training-free, inference-time algorithmic primitive that improves the economic behavior of autonomous multi-agent systems. The approach is parameter-free once the LLM controller is fixed, unifies aggregation types under one solver, and demonstrates task-adaptive allocations, all of which are concrete strengths for practical deployment.

major comments (2)

[§4 (Experiments)] §4 (Experiments), APPS results at α=0.5: the central claim that the water-filling allocation is near-optimal rests on the untested premise that the LLM-estimated utility curves are sufficiently accurate. No direct comparison is reported between the predicted utilities and the empirical task qualities obtained when the pipeline is executed at the returned budgets; without this check the 94.4% vs 88.1% gap could arise from prompting artifacts rather than curve fidelity.
[§3 (Method)] §3 (Method), water-filling procedure: the robustness claim to curve-estimation noise is stated but not quantified with respect to the actual prediction error observed on the utility curves; a sensitivity plot or error-propagation analysis would be required to substantiate that the solver still yields near-optimal splits under realistic LLM estimation variance.

minor comments (2)

[§3] The notation for the per-phase utility function U_i(b) and the budget fraction α should be introduced with an explicit equation early in §3 to aid readers who are not already familiar with continuous knapsack formulations.
[Table 1] Table 1 (APPS aggregate metrics) would be clearer if it reported the number of independent runs and the exact statistical test used to establish significance for each metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing our strongest honest defense while committing to revisions that strengthen the manuscript without misrepresenting our results.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments), APPS results at α=0.5: the central claim that the water-filling allocation is near-optimal rests on the untested premise that the LLM-estimated utility curves are sufficiently accurate. No direct comparison is reported between the predicted utilities and the empirical task qualities obtained when the pipeline is executed at the returned budgets; without this check the 94.4% vs 88.1% gap could arise from prompting artifacts rather than curve fidelity.

Authors: We acknowledge the validity of this observation: a direct side-by-side validation of predicted utilities versus realized empirical qualities at the ZEBRA-allocated budgets is not present in the current manuscript and would provide clearer evidence that the performance gains stem from curve fidelity rather than prompting effects. While the statistically significant improvement over LLM-direct (which employs comparable prompting) offers indirect support, we agree this does not fully substitute for the requested check. In the revision we will add an analysis (e.g., a table or scatter plot) comparing LLM-estimated utilities to actual task qualities obtained by executing the pipeline at the returned per-phase budgets on the APPS benchmark. revision: yes
Referee: [§3 (Method)] §3 (Method), water-filling procedure: the robustness claim to curve-estimation noise is stated but not quantified with respect to the actual prediction error observed on the utility curves; a sensitivity plot or error-propagation analysis would be required to substantiate that the solver still yields near-optimal splits under realistic LLM estimation variance.

Authors: The manuscript does report that allocations remain empirically robust to curve-estimation noise on the HotpotQA transfer experiment, and the water-filling solver is designed to be stable under monotonic utility curves. Nevertheless, we agree that this robustness has not been quantified against the specific magnitude of prediction error observed in our utility-curve estimates, nor accompanied by a sensitivity or error-propagation study. To address the referee’s request we will include, in the revised manuscript, a sensitivity plot that injects controlled noise matching the observed LLM estimation variance and reports the resulting variation in allocation and end-to-end quality. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents ZEBRA as a zero-shot algorithmic method that reduces budget allocation to a nonlinear knapsack problem solved via water-filling on a Lagrange multiplier after an LLM estimates per-phase utility curves. The reported performance numbers (e.g., 94.4% recovery at α=0.5 on APPS) are measured outcomes from applying this procedure to benchmarks and comparing against LLM-direct; they are not used to define, fit, or construct the allocation rule itself. No equation or step equates a claimed result to its own inputs by construction, and no load-bearing premise reduces to a self-citation or fitted parameter. The derivation chain from problem statement to solver is self-contained and independent of the final empirical metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new physical entities or forces. The main modeling choice is the assumption that per-phase utility can be represented as continuous curves estimable by an LLM; this is treated as a domain assumption rather than a fitted parameter. No free parameters are explicitly introduced in the abstract description of the solver.

axioms (1)

domain assumption An LLM can produce usable estimates of per-phase utility curves from zero-shot prompting.
Invoked when the framework states that the LLM controller estimates the curves before the water-filling search is run.

pith-pipeline@v0.9.0 · 5822 in / 1476 out tokens · 26750 ms · 2026-05-21T06:57:48.525555+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

an LLM controller estimates per-phase utility curves, and a water-filling search on the Lagrange multiplier returns the per-phase split
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fi(x)=ai(1−e−bix) … f′i(xi)=λ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 1 internal anchor

[1]

URLhttps://www.cs.toronto.edu/~cebly/Papers/DBMDPs_uai.pdf. S. Boyd and L. Vandenberghe.Convex Optimization. Cambridge University Press, 2004. K. Brown, A. Muppidi, and R. Shahout. Predictive scheduling for efficient inference-time reasoning in large language models. InES-FoMo III: Workshop on Efficient Systems for Foundation Models (ICML 2025), 2025. URL...

work page doi:10.1126/science.abq1158 2004
[2]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

URLhttps://arxiv.org/abs/1809.09600. S. Yegge and contributors. Gas Town - multi-agent workspace manager, 2026. URL https: //github.com/steveyegge/gastown. Accessed: 2026-03-14. M. Zhao, Q. Qi, and H. Sun. ROI-reasoning: Rational optimization for inference via pre-computation meta-cognition, 2026. URLhttps://arxiv.org/abs/2601.03822. 12 A Extended Related...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

plan": {{

refine – review/revise loop using gpt-4o-2024-08-06 (∼17×more expensive per call than gpt-4o-mini) Carefully consider the specific task above and its difficulty. For each phase, estimate the following parameters for THIS particular task: - tokens_basic (integer, 100–10000): total output tokens this phase needs to produce basically acceptable output (∼50% ...

work page 2024
[10]

plan": ...,

refine – review→revise loop that catches and fixes real bugs. Each iteration = 2 LLM calls. Uses gpt-4o-2024-08-06 (∼17×cost per call vs gpt-4o-mini-2024-07-18). Token-heavy. Allocate the total budget across these 4 phases. Output ONLY a JSON object mapping each phase to its USD allocation, with no extra text. 35 (c) LLM-CoT allocator prompt.Identical to ...

work page 2024
[11]

plan – understand the task and create a plan

work page
[12]

decompose – break the plan into implementable tasks

work page
[13]

implement – write the solution code

work page
[14]

plan": {

refine – review/revise loop to fix bugs We have estimated the following utility curves for each phase. Each curve models quality=a·(1−e −b·budget), where: - a (quality ceiling): maximum quality this phase can achieve - b (saturation rate): how quickly returns diminish per dollar spent - Higher b means the phase saturates quickly (needs less budget) - High...

work page 1985
[15]

Implement frequently produces a near-correct but edge-case-buggy solution (the always- add-+1 shortcut, the missing ‘S’ not in trophies branch, etc.). All five strategies’ implement phases share a similar bug rate – mean implement-only score is 0.27–0.31 across LLM, LLM-CoT, ZEBRA-LLM, and mult_offset; additive is slightly higher at 0.41 thanks to a longe...

work page
[16]

Refinecanfix these bugs – review correctly identifies the edge case on every refine-fire seed we audited – but only if the revise call has enough budget to actuallyfinishwriting the corrected function. With $7–9×10 −3 on refine, gpt-4o exhausts its per-call output-token budget mid-revision and the wrapper logs [revise] Output missing END_OF_OUTPUT token ....

work page
[17]

LLM-based allocators sit in the first regime (LLM-Direct$9.08±0.62; LLM-CoT $6.68±1.91; ZEBRA- LLM ablation $7.28±1.89 )

ZEBRA’s allocators sit in the second regime on every seed (additive at $13.03±0.24 , mult_offset at $13.14±0.24 , both essentially deterministic across seeds). LLM-based allocators sit in the first regime (LLM-Direct$9.08±0.62; LLM-CoT $6.68±1.91; ZEBRA- LLM ablation $7.28±1.89 ). The ZEBRA-LLM ablation in particular shows that the win is from thealgorith...

work page

[1] [1]

URLhttps://www.cs.toronto.edu/~cebly/Papers/DBMDPs_uai.pdf. S. Boyd and L. Vandenberghe.Convex Optimization. Cambridge University Press, 2004. K. Brown, A. Muppidi, and R. Shahout. Predictive scheduling for efficient inference-time reasoning in large language models. InES-FoMo III: Workshop on Efficient Systems for Foundation Models (ICML 2025), 2025. URL...

work page doi:10.1126/science.abq1158 2004

[2] [2]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

URLhttps://arxiv.org/abs/1809.09600. S. Yegge and contributors. Gas Town - multi-agent workspace manager, 2026. URL https: //github.com/steveyegge/gastown. Accessed: 2026-03-14. M. Zhao, Q. Qi, and H. Sun. ROI-reasoning: Rational optimization for inference via pre-computation meta-cognition, 2026. URLhttps://arxiv.org/abs/2601.03822. 12 A Extended Related...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [6]

plan": {{

refine – review/revise loop using gpt-4o-2024-08-06 (∼17×more expensive per call than gpt-4o-mini) Carefully consider the specific task above and its difficulty. For each phase, estimate the following parameters for THIS particular task: - tokens_basic (integer, 100–10000): total output tokens this phase needs to produce basically acceptable output (∼50% ...

work page 2024

[4] [10]

plan": ...,

refine – review→revise loop that catches and fixes real bugs. Each iteration = 2 LLM calls. Uses gpt-4o-2024-08-06 (∼17×cost per call vs gpt-4o-mini-2024-07-18). Token-heavy. Allocate the total budget across these 4 phases. Output ONLY a JSON object mapping each phase to its USD allocation, with no extra text. 35 (c) LLM-CoT allocator prompt.Identical to ...

work page 2024

[5] [11]

plan – understand the task and create a plan

work page

[6] [12]

decompose – break the plan into implementable tasks

work page

[7] [13]

implement – write the solution code

work page

[8] [14]

plan": {

refine – review/revise loop to fix bugs We have estimated the following utility curves for each phase. Each curve models quality=a·(1−e −b·budget), where: - a (quality ceiling): maximum quality this phase can achieve - b (saturation rate): how quickly returns diminish per dollar spent - Higher b means the phase saturates quickly (needs less budget) - High...

work page 1985

[9] [15]

Implement frequently produces a near-correct but edge-case-buggy solution (the always- add-+1 shortcut, the missing ‘S’ not in trophies branch, etc.). All five strategies’ implement phases share a similar bug rate – mean implement-only score is 0.27–0.31 across LLM, LLM-CoT, ZEBRA-LLM, and mult_offset; additive is slightly higher at 0.41 thanks to a longe...

work page

[10] [16]

Refinecanfix these bugs – review correctly identifies the edge case on every refine-fire seed we audited – but only if the revise call has enough budget to actuallyfinishwriting the corrected function. With $7–9×10 −3 on refine, gpt-4o exhausts its per-call output-token budget mid-revision and the wrapper logs [revise] Output missing END_OF_OUTPUT token ....

work page

[11] [17]

LLM-based allocators sit in the first regime (LLM-Direct$9.08±0.62; LLM-CoT $6.68±1.91; ZEBRA- LLM ablation $7.28±1.89 )

ZEBRA’s allocators sit in the second regime on every seed (additive at $13.03±0.24 , mult_offset at $13.14±0.24 , both essentially deterministic across seeds). LLM-based allocators sit in the first regime (LLM-Direct$9.08±0.62; LLM-CoT $6.68±1.91; ZEBRA- LLM ablation $7.28±1.89 ). The ZEBRA-LLM ablation in particular shows that the win is from thealgorith...

work page