pith. sign in

arxiv: 2605.00528 · v2 · pith:J2QAIDUVnew · submitted 2026-05-01 · 💻 cs.DC · cs.AI· cs.LG· cs.OS

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

Pith reviewed 2026-05-09 19:10 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LGcs.OS
keywords AI agentsworkflow schedulingGPU clustersKV cache reusedistributed schedulingcompound AImulti-tenant inference
0
0 comments X

The pith

Scheduling AI agent workflows as single units reduces task completion time by 1.64x on GPU clusters

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI agents perform tens to hundreds of chained LLM calls for a single task, yet current schedulers treat each call independently and discard intermediate cached state, inflating end-to-end latency. SAGA introduces program-level scheduling by treating the full agent workflow as the basic unit. It uses graphs to predict cache reuse, affinity batching to keep related work together, and a fairness metric based on task completion time. In tests on a 64-GPU cluster with coding and browser agent tasks, this yields faster completions, better memory efficiency, and reliable service levels even under contention, at the cost of some peak throughput. A sympathetic reader would care because compound AI workloads are becoming common and current systems are mismatched to them.

Core claim

The central claim is that shifting from request-level to workflow-atomic scheduling for AI agent inference allows prediction of KV cache reuse across tool calls using Agent Execution Graphs, enabling co-location of correlated requests via affinity batching and work stealing, and fair sharing via Agent Fair Share, which together reduce task completion time by 1.64x geometric mean over standard approaches on real workloads.

What carries the argument

Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, together with session-affinity batching and Agent Fair Share fairness metric.

If this is right

  • Preserving KV cache state across chained calls within a workflow reduces overall task latency by 3-8x compared to discarding it.
  • Co-locating correlated requests improves GPU memory utilization by 1.22x.
  • Task-completion-time fairness ensures bounded deviation from ideal shares even in multi-tenant settings.
  • 99.2% SLO attainment is achieved under interference while maintaining load balance through work stealing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method may generalize to other multi-step AI pipelines where state reuse across steps is valuable.
  • If agent workflows include more unpredictable elements, online graph updates could be needed to maintain accuracy.
  • The latency-throughput tradeoff highlights a design space for schedulers balancing interactive use versus maximum efficiency.

Load-bearing premise

Agent workflows have enough predictable structure that execution graphs can accurately forecast which KV cache entries will be reused across tool-call boundaries.

What would settle it

Measure the actual KV cache reuse rates and task completion times on a set of agent tasks with highly variable or unpredictable tool call sequences; if the reuse predictions are poor and the 1.64x improvement does not appear, the scheduling benefit does not hold.

Figures

Figures reproduced from arXiv: 2605.00528 by Dongxin Guo, Jikun Wu, Siu Ming Yiu.

Figure 1
Figure 1. Figure 1: Inefficiencies in serving agent workloads with view at source ↗
Figure 1
Figure 1. Figure 1: cache is evicted during tool calls and must be regenerated, view at source ↗
Figure 3
Figure 3. Figure 3: Concrete AEG for a SWE-bench coding agent. Nodes view at source ↗
Figure 2
Figure 2. Figure 2: SAGA architecture. Layer 1 captures work view at source ↗
read the original abstract

AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries, achieving within 1.31x of B\'el\'ady's optimal offline policy; (2) session-affinity batching with work stealing that co-locates correlated requests while maintaining global load balance; and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster serving SWE-bench coding agents and WebArena browser tasks, SAGA reduces task completion time by 1.64x (geometric mean, p < 0.001) over vLLM v0.15.1 with prefix caching and affinity routing, while improving GPU memory utilization by 1.22x and achieving 99.2% SLO attainment under multi-tenant interference. These latency gains come at a quantified cost: approximately 30% lower peak throughput than throughput-optimal batch scheduling, a tradeoff appropriate for the latency-sensitive interactive deployments that dominate compound AI usage. Our results demonstrate that workflow-aware scheduling is essential for efficient compound AI serving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SAGA, a distributed GPU-cluster scheduler for compound AI agent workloads. It argues that request-level scheduling is mismatched to workflows consisting of tens to hundreds of chained LLM calls and introduces workflow-atomic scheduling via three mechanisms: (1) Agent Execution Graphs that statically or trace-based predict KV-cache reuse across tool-call boundaries (claimed to reach within 1.31× of Bélády optimality), (2) session-affinity batching augmented by work stealing to co-locate correlated requests while preserving load balance, and (3) Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster running SWE-bench coding agents and WebArena browser tasks, the paper reports a 1.64× geometric-mean reduction in task completion time (p < 0.001) versus vLLM v0.15.1 with prefix caching and affinity routing, together with 1.22× higher GPU memory utilization and 99.2 % SLO attainment under multi-tenant interference, at the cost of approximately 30 % lower peak throughput.

Significance. If the reported performance numbers and the underlying KV-reuse predictions prove reproducible, the work would establish that program-level rather than request-level scheduling can materially improve end-to-end latency for latency-sensitive agent deployments while still providing fairness guarantees. The explicit quantification of the throughput-latency trade-off and the provision of bounded-deviation fairness are positive features that could influence the design of future serving systems for compound AI.

major comments (2)
  1. [Abstract] Abstract: the central 1.64× task-completion-time claim (and the supporting 1.31× Bélády-optimality claim for Agent Execution Graphs) is presented with a p-value and named baselines, yet the abstract supplies no experimental methodology, workload characterization, number of runs, or analysis of confounds. Because these numbers are direct measurements rather than quantities derived from the paper’s own equations, the absence of this information is load-bearing for the soundness of the primary result.
  2. [Abstract] Abstract (and § on Agent Execution Graphs, if present): the performance advantage is predicated on the graphs accurately forecasting KV-cache reuse across dynamic tool-call boundaries in SWE-bench and WebArena agents. No description is given of graph construction (static analysis vs. limited traces), how branching or state changes are handled, or any empirical accuracy measurement under runtime variability; if prediction accuracy falls below the stated 1.31× factor, the session-affinity and work-stealing decisions revert to request-level behavior and the reported gains disappear.
minor comments (2)
  1. [Abstract] Abstract: the phrase “Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees” is introduced without even a one-sentence definition or reference to the theorem that establishes the bound.
  2. [Abstract] Abstract: the 30 % throughput reduction is stated as a quantified cost but is not accompanied by the absolute throughput numbers or the precise operating point (batch size, SLO target) at which the comparison was made.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the full manuscript and indicating revisions made to improve the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central 1.64× task-completion-time claim (and the supporting 1.31× Bélády-optimality claim for Agent Execution Graphs) is presented with a p-value and named baselines, yet the abstract supplies no experimental methodology, workload characterization, number of runs, or analysis of confounds. Because these numbers are direct measurements rather than quantities derived from the paper’s own equations, the absence of this information is load-bearing for the soundness of the primary result.

    Authors: We agree that the abstract's brevity leaves the primary claims without sufficient supporting context on methodology. The full manuscript (Section 5) provides the workload characterization (SWE-bench coding agents and WebArena browser tasks), cluster details (64 GPUs), number of runs (repeated independent executions for geometric mean and p-value computation), and confound controls (e.g., network and utilization variations). To make the abstract more self-contained while respecting length limits, we have revised it to include a concise methodology summary: 'evaluated on a 64-GPU cluster with SWE-bench and WebArena workloads over repeated runs with statistical significance testing (p < 0.001)'. This directly addresses the soundness concern for the reported measurements. revision: yes

  2. Referee: [Abstract] Abstract (and § on Agent Execution Graphs, if present): the performance advantage is predicated on the graphs accurately forecasting KV-cache reuse across dynamic tool-call boundaries in SWE-bench and WebArena agents. No description is given of graph construction (static analysis vs. limited traces), how branching or state changes are handled, or any empirical accuracy measurement under runtime variability; if prediction accuracy falls below the stated 1.31× factor, the session-affinity and work-stealing decisions revert to request-level behavior and the reported gains disappear.

    Authors: The manuscript's Section 3 already describes Agent Execution Graph construction as a hybrid of static analysis of agent workflow code and limited trace-based profiling for KV-cache reuse across tool-call boundaries. Branching and state changes are handled by modeling likely execution paths from traces with conservative reuse estimates. We acknowledge the abstract omits these details and the need for explicit accuracy validation. In the revision, we have expanded Section 3 with empirical accuracy measurements under runtime variability for the evaluated agents (confirming the 1.31× Bélády factor holds in practice) and added a brief reference in the abstract. If accuracy degraded substantially below this level, gains would indeed reduce to request-level scheduling, but the added measurements demonstrate robustness for the reported workloads. revision: partial

Circularity Check

0 steps flagged

No circularity; key results are direct empirical measurements

full rationale

The paper presents its central claims—the 1.64x task-completion-time reduction (geometric mean), 1.22x memory utilization improvement, 99.2% SLO attainment, and Agent Execution Graphs achieving within 1.31x of Bélády optimality—as outcomes of experiments on SWE-bench and WebArena benchmarks. These are not derived quantities obtained by fitting parameters inside the paper's own equations or by renaming inputs as predictions. The fairness metric is described as having 'provable bounded-deviation guarantees,' but no self-referential reduction or self-citation chain is exhibited in the provided text that would make the guarantees equivalent to the inputs by construction. The derivation chain remains self-contained against external benchmarks and does not match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the new abstractions of workflow-level scheduling and predictable KV cache reuse across agent steps; no explicit numerical free parameters are stated in the abstract.

axioms (1)
  • domain assumption AI agent workflows possess sufficient structure to allow construction of execution graphs that predict KV cache reuse across tool-call boundaries
    Invoked as the basis for the first mechanism achieving within 1.31x of optimal offline policy.
invented entities (2)
  • Agent Execution Graphs no independent evidence
    purpose: Capture workflow structure to predict KV cache reuse across tool calls
    New abstraction introduced to enable program-level rather than request-level scheduling.
  • Agent Fair Share no independent evidence
    purpose: Provide task-completion-time fairness with provable bounded-deviation guarantees
    New fairness metric defined for multi-tenant agent workloads.

pith-pipeline@v0.9.0 · 5607 in / 1588 out tokens · 56906 ms · 2026-05-09T19:10:49.517777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI

    cs.OS 2026-05 unverdicted novelty 6.0

    MORI improves throughput 20-71% and TTFT 18-43% over baselines by ranking programs on a continuous idleness spectrum and shifting the GPU-CPU boundary to match capacity in agentic LLM serving.