Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving
Pith reviewed 2026-05-15 14:14 UTC · model grok-4.3
The pith
Dynamic local routing of append-prefills on decode nodes reduces multi-turn TTFT by about 68 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prefill Prefill-capable Decode (PPD) disaggregation is a dynamic routing system that decides when to process Turn 2+ requests locally on decode nodes using cached KV states. It adapts to varying SLOs via configurable weights and integrates with existing PD deployments. The result is a reduction of Turn 2+ TTFT by approximately 68 percent while keeping TPOT competitive and easing KV-transfer congestion under load.
What carries the argument
PPD disaggregation, a dynamic routing mechanism that routes append-prefill operations locally to decode nodes to reuse cached KV states and avoid cross-node transfers, controlled by configurable weights for SLO adaptation.
Load-bearing premise
Append-prefill incurs an order-of-magnitude smaller decoding slowdown than full prefill, and no single fixed routing strategy can meet all service-level objectives at once.
What would settle it
A direct measurement showing that append-prefill causes decoding slowdown comparable to full prefill, or a static routing policy that simultaneously meets all tested SLOs without dynamic weight adjustment.
read the original abstract
Prefill-Decode (PD) disaggregation has become the standard architecture for modern LLM inference engines, which alleviates the interference of two distinctive workloads. With the growing demand for multi-turn interactions in chatbots and agentic systems, we re-examined PD in this case and found two fundamental inefficiencies: (1) every turn requires prefilling the new prompt and response from the last turn, and (2) repeated KV transfers between prefill and decode nodes saturate the bandwidth, leading to high latency and even service degradation. Our key insight is that not all prefill operations are equally disruptive: append-prefill, which processes only the new input tokens while reusing cached KV states, incurs an order-of-magnitude smaller decoding slowdown than full prefill. This motivates routing append-prefill to decode nodes locally. However, through comprehensive analysis, we show that no single fixed routing strategy satisfies all Service Level Objectives (SLOs) simultaneously. Based on this insight, we propose Prefill Prefill-capable Decode (PPD) disaggregation, a dynamic routing system that decides when to process Turn 2+ requests locally on decode nodes using cached KV states. PPD adapts to varying SLOs via configurable weights and seamlessly integrates with traditional PD deployments. With extensive evaluations, we show that PPD reduces Turn 2+ time-to-first-token (TTFT) by $\sim$68\% while maintaining competitive time-per-output-token (TPOT), effectively alleviating KV transfer congestion under high load. PPD provides a flexible and efficient paradigm for multi-turn LLM serving.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes PPD (Prefill Prefill-capable Decode) disaggregation for multi-turn LLM serving. Building on standard Prefill-Decode (PD) architectures, it identifies two inefficiencies: repeated full prefill of new prompts each turn and KV-cache transfer congestion between nodes. The core insight is that append-prefill (processing only new tokens while reusing cached KV states) incurs far less decoding slowdown than full prefill. This motivates routing append-prefill locally on decode nodes. The paper argues that no single fixed routing policy meets all SLOs simultaneously, and therefore introduces dynamic routing with configurable weights that adapts to varying SLO targets. Extensive evaluations are reported to show a ~68% reduction in Turn 2+ TTFT while preserving competitive TPOT and relieving KV-transfer pressure under high load.
Significance. If the performance claims are substantiated with proper fixed-policy baselines, the work offers a pragmatic extension to existing PD deployments that could meaningfully improve latency for multi-turn chat and agentic workloads. The emphasis on empirical system measurements and the claim of seamless integration with conventional PD setups are strengths. However, the overall significance is tempered by the absence of direct comparisons that would confirm the necessity of dynamic routing over simpler static alternatives.
major comments (2)
- [Abstract and motivation/analysis section] The load-bearing claim that 'no single fixed routing strategy satisfies all SLOs simultaneously' (Abstract and motivation section) is not supported by reported comparisons against optimized static policies such as always-local append-prefill routing or fixed token-length thresholds. Without exhaustive evaluation of such baselines across the same SLO targets, workloads, and load regimes used for PPD, the justification for introducing dynamic routing and configurable weights remains unestablished.
- [Evaluation summary (Abstract)] The central empirical result of ~68% Turn 2+ TTFT reduction (Abstract and evaluation summary) is presented without sufficient detail on the precise baselines, workload definitions, number of runs, error bars, or potential post-hoc workload selection. These omissions make it difficult to assess whether the reported gains are robust or could be matched by simpler static append-prefill policies.
minor comments (1)
- [Abstract] The abstract refers to 'comprehensive analysis' demonstrating that no fixed strategy works, yet provides no section pointer or concise summary of that analysis, hindering quick location of the supporting evidence.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the paper without misrepresenting our existing results.
read point-by-point responses
-
Referee: [Abstract and motivation/analysis section] The load-bearing claim that 'no single fixed routing strategy satisfies all SLOs simultaneously' (Abstract and motivation section) is not supported by reported comparisons against optimized static policies such as always-local append-prefill routing or fixed token-length thresholds. Without exhaustive evaluation of such baselines across the same SLO targets, workloads, and load regimes used for PPD, the justification for introducing dynamic routing and configurable weights remains unestablished.
Authors: We appreciate the referee pointing out the need for stronger empirical grounding of this claim. The motivation section presents analysis showing that static policies (including always-local append-prefill and fixed token-length thresholds) fail to simultaneously meet all SLO targets under varying load and workload conditions, which motivated the dynamic approach. However, we acknowledge that more exhaustive side-by-side comparisons against these optimized static policies across the full range of SLO targets, workloads, and load regimes would better substantiate the necessity of dynamic routing. We will add these comparisons in the revised manuscript. revision: yes
-
Referee: [Evaluation summary (Abstract)] The central empirical result of ~68% Turn 2+ TTFT reduction (Abstract and evaluation summary) is presented without sufficient detail on the precise baselines, workload definitions, number of runs, error bars, or potential post-hoc workload selection. These omissions make it difficult to assess whether the reported gains are robust or could be matched by simpler static append-prefill policies.
Authors: We agree that greater transparency on the experimental setup is required. The reported ~68% TTFT reduction is measured against standard PD disaggregation baselines under the multi-turn workloads described in the evaluation section. In the revision we will expand the abstract and evaluation sections to explicitly detail the precise baselines, workload definitions (including turn distributions and request patterns), number of runs, error bars, and the workload selection methodology to confirm the results are robust and not matched by simpler static append-prefill policies. revision: yes
Circularity Check
No circularity: claims rest on empirical analysis and evaluation
full rationale
The paper presents an empirical systems proposal for PPD disaggregation. Its central claims derive from observed performance differences between full prefill and append-prefill (stated as an order-of-magnitude slowdown difference) and from analysis showing fixed routing limitations, followed by direct measurement of TTFT/TPOT improvements under load. No equations, fitted parameters, or self-citation chains are used to derive the reported reductions; the 68% TTFT figure is an evaluation outcome, not a constructed prediction. The motivation for dynamic routing is presented as an analysis result rather than a self-definitional or fitted-input reduction. The derivation chain is therefore self-contained against external benchmarks and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- configurable weights for SLO adaptation
axioms (1)
- domain assumption Append-prefill incurs an order-of-magnitude smaller decoding slowdown than full prefill.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.