SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
Pith reviewed 2026-05-09 19:10 UTC · model grok-4.3
The pith
Scheduling AI agent workflows as single units reduces task completion time by 1.64x on GPU clusters
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that shifting from request-level to workflow-atomic scheduling for AI agent inference allows prediction of KV cache reuse across tool calls using Agent Execution Graphs, enabling co-location of correlated requests via affinity batching and work stealing, and fair sharing via Agent Fair Share, which together reduce task completion time by 1.64x geometric mean over standard approaches on real workloads.
What carries the argument
Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, together with session-affinity batching and Agent Fair Share fairness metric.
If this is right
- Preserving KV cache state across chained calls within a workflow reduces overall task latency by 3-8x compared to discarding it.
- Co-locating correlated requests improves GPU memory utilization by 1.22x.
- Task-completion-time fairness ensures bounded deviation from ideal shares even in multi-tenant settings.
- 99.2% SLO attainment is achieved under interference while maintaining load balance through work stealing.
Where Pith is reading between the lines
- This method may generalize to other multi-step AI pipelines where state reuse across steps is valuable.
- If agent workflows include more unpredictable elements, online graph updates could be needed to maintain accuracy.
- The latency-throughput tradeoff highlights a design space for schedulers balancing interactive use versus maximum efficiency.
Load-bearing premise
Agent workflows have enough predictable structure that execution graphs can accurately forecast which KV cache entries will be reused across tool-call boundaries.
What would settle it
Measure the actual KV cache reuse rates and task completion times on a set of agent tasks with highly variable or unpredictable tool call sequences; if the reuse predictions are poor and the 1.64x improvement does not appear, the scheduling benefit does not hold.
Figures
read the original abstract
AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, achieving within 1.31x of B\'el\'ady's optimal offline policy; (2) session-affinity batching with work stealing that co-locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster serving SWE-bench coding agents and WebArena browser tasks, SAGA reduces task completion time by 1.64x (geometric mean, p < 0.001) over vLLM v0.15.1 with prefix caching and affinity routing, while improving GPU memory utilization by 1.22x and achieving 99.2% SLO attainment under multi-tenant interference. These latency gains come at a quantified cost: approximately 30% lower peak throughput than throughput-optimal batch scheduling, a tradeoff appropriate for the latency-sensitive interactive deployments that dominate compound AI usage. Our results demonstrate that workflow-aware scheduling is essential for efficient compound AI serving.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SAGA, a distributed GPU-cluster scheduler for compound AI agent workloads. It argues that request-level scheduling is mismatched to workflows consisting of tens to hundreds of chained LLM calls and introduces workflow-atomic scheduling via three mechanisms: (1) Agent Execution Graphs that statically or trace-based predict KV-cache reuse across tool-call boundaries (claimed to reach within 1.31× of Bélády optimality), (2) session-affinity batching augmented by work stealing to co-locate correlated requests while preserving load balance, and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster running SWE-bench coding agents and WebArena browser tasks, the paper reports a 1.64× geometric-mean reduction in task completion time (p < 0.001) versus vLLM v0.15.1 with prefix caching and affinity routing, together with 1.22× higher GPU memory utilization and 99.2 % SLO attainment under multi-tenant interference, at the cost of approximately 30 % lower peak throughput.
Significance. If the reported performance numbers and the underlying KV-reuse predictions prove reproducible, the work would establish that program-level rather than request-level scheduling can materially improve end-to-end latency for latency-sensitive agent deployments while still providing fairness guarantees. The explicit quantification of the throughput-latency trade-off and the provision of bounded-deviation fairness are positive features that could influence the design of future serving systems for compound AI.
major comments (2)
- [Abstract] Abstract: the central 1.64× task-completion-time claim (and the supporting 1.31× Bélády-optimality claim for Agent Execution Graphs) is presented with a p-value and named baselines, yet the abstract supplies no experimental methodology, workload characterization, number of runs, or analysis of confounds. Because these numbers are direct measurements rather than quantities derived from the paper’s own equations, the absence of this information is load-bearing for the soundness of the primary result.
- [Abstract] Abstract (and § on Agent Execution Graphs, if present): the performance advantage is predicated on the graphs accurately forecasting KV-cache reuse across dynamic tool-call boundaries in SWE-bench and WebArena agents. No description is given of graph construction (static analysis vs. limited traces), how branching or state changes are handled, or any empirical accuracy measurement under runtime variability; if prediction accuracy falls below the stated 1.31× factor, the session-affinity and work-stealing decisions revert to request-level behavior and the reported gains disappear.
minor comments (2)
- [Abstract] Abstract: the phrase “Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees” is introduced without even a one-sentence definition or reference to the theorem that establishes the bound.
- [Abstract] Abstract: the 30 % throughput reduction is stated as a quantified cost but is not accompanied by the absolute throughput numbers or the precise operating point (batch size, SLO target) at which the comparison was made.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the full manuscript and indicating revisions made to improve the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central 1.64× task-completion-time claim (and the supporting 1.31× Bélády-optimality claim for Agent Execution Graphs) is presented with a p-value and named baselines, yet the abstract supplies no experimental methodology, workload characterization, number of runs, or analysis of confounds. Because these numbers are direct measurements rather than quantities derived from the paper’s own equations, the absence of this information is load-bearing for the soundness of the primary result.
Authors: We agree that the abstract's brevity leaves the primary claims without sufficient supporting context on methodology. The full manuscript (Section 5) provides the workload characterization (SWE-bench coding agents and WebArena browser tasks), cluster details (64 GPUs), number of runs (repeated independent executions for geometric mean and p-value computation), and confound controls (e.g., network and utilization variations). To make the abstract more self-contained while respecting length limits, we have revised it to include a concise methodology summary: 'evaluated on a 64-GPU cluster with SWE-bench and WebArena workloads over repeated runs with statistical significance testing (p < 0.001)'. This directly addresses the soundness concern for the reported measurements. revision: yes
-
Referee: [Abstract] Abstract (and § on Agent Execution Graphs, if present): the performance advantage is predicated on the graphs accurately forecasting KV-cache reuse across dynamic tool-call boundaries in SWE-bench and WebArena agents. No description is given of graph construction (static analysis vs. limited traces), how branching or state changes are handled, or any empirical accuracy measurement under runtime variability; if prediction accuracy falls below the stated 1.31× factor, the session-affinity and work-stealing decisions revert to request-level behavior and the reported gains disappear.
Authors: The manuscript's Section 3 already describes Agent Execution Graph construction as a hybrid of static analysis of agent workflow code and limited trace-based profiling for KV-cache reuse across tool-call boundaries. Branching and state changes are handled by modeling likely execution paths from traces with conservative reuse estimates. We acknowledge the abstract omits these details and the need for explicit accuracy validation. In the revision, we have expanded Section 3 with empirical accuracy measurements under runtime variability for the evaluated agents (confirming the 1.31× Bélády factor holds in practice) and added a brief reference in the abstract. If accuracy degraded substantially below this level, gains would indeed reduce to request-level scheduling, but the added measurements demonstrate robustness for the reported workloads. revision: partial
Circularity Check
No circularity; key results are direct empirical measurements
full rationale
The paper presents its central claims—the 1.64x task-completion-time reduction (geometric mean), 1.22x memory utilization improvement, 99.2% SLO attainment, and Agent Execution Graphs achieving within 1.31x of Bélády optimality—as outcomes of experiments on SWE-bench and WebArena benchmarks. These are not derived quantities obtained by fitting parameters inside the paper's own equations or by renaming inputs as predictions. The fairness metric is described as having 'provable bounded-deviation guarantees,' but no self-referential reduction or self-citation chain is exhibited in the provided text that would make the guarantees equivalent to the inputs by construction. The derivation chain remains self-contained against external benchmarks and does not match any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption AI agent workflows possess sufficient structure to allow construction of execution graphs that predict KV cache reuse across tool-call boundaries
invented entities (2)
-
Agent Execution Graphs
no independent evidence
-
Agent Fair Share
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI
MORI improves throughput 20-71% and TTFT 18-43% over baselines by ranking programs on a continuous idleness spectrum and shifting the GPU-CPU boundary to match capacity in agentic LLM serving.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.