arxiv: 2603.13358 · v2 · submitted 2026-03-09 · 💻 cs.NI

Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving

Zongze Li , Jingyu Liu , Zhen Xu , Yineng Zhang , Tahseen Rabbani , Ce Zhang This is my paper

Pith reviewed 2026-05-15 14:14 UTC · model grok-4.3

classification 💻 cs.NI

keywords multi-turn LLM servingprefill-decode disaggregationKV cacheTTFTTPOTdynamic routingappend-prefillservice level objectives

0 comments p. Extension

The pith

Dynamic local routing of append-prefills on decode nodes reduces multi-turn TTFT by about 68 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In multi-turn LLM conversations every new turn needs the latest prompt prefilled while the prior context sits in cache. The paper finds that these append-prefills interfere with decoding far less than full prefills of fresh conversations. Routing the lighter append-prefills to decode nodes avoids repeated KV-cache transfers across the network. A dynamic router chooses when to keep the work local, guided by adjustable weights that match different service-level targets. Evaluations show the change delivers substantially faster first tokens on later turns while output speed per token stays competitive.

Core claim

Prefill Prefill-capable Decode (PPD) disaggregation is a dynamic routing system that decides when to process Turn 2+ requests locally on decode nodes using cached KV states. It adapts to varying SLOs via configurable weights and integrates with existing PD deployments. The result is a reduction of Turn 2+ TTFT by approximately 68 percent while keeping TPOT competitive and easing KV-transfer congestion under load.

What carries the argument

PPD disaggregation, a dynamic routing mechanism that routes append-prefill operations locally to decode nodes to reuse cached KV states and avoid cross-node transfers, controlled by configurable weights for SLO adaptation.

Load-bearing premise

Append-prefill incurs an order-of-magnitude smaller decoding slowdown than full prefill, and no single fixed routing strategy can meet all service-level objectives at once.

What would settle it

A direct measurement showing that append-prefill causes decoding slowdown comparable to full prefill, or a static routing policy that simultaneously meets all tested SLOs without dynamic weight adjustment.

read the original abstract

Prefill-Decode (PD) disaggregation has become the standard architecture for modern LLM inference engines, which alleviates the interference of two distinctive workloads. With the growing demand for multi-turn interactions in chatbots and agentic systems, we re-examined PD in this case and found two fundamental inefficiencies: (1) every turn requires prefilling the new prompt and response from the last turn, and (2) repeated KV transfers between prefill and decode nodes saturate the bandwidth, leading to high latency and even service degradation. Our key insight is that not all prefill operations are equally disruptive: append-prefill, which processes only the new input tokens while reusing cached KV states, incurs an order-of-magnitude smaller decoding slowdown than full prefill. This motivates routing append-prefill to decode nodes locally. However, through comprehensive analysis, we show that no single fixed routing strategy satisfies all Service Level Objectives (SLOs) simultaneously. Based on this insight, we propose Prefill Prefill-capable Decode (PPD) disaggregation, a dynamic routing system that decides when to process Turn 2+ requests locally on decode nodes using cached KV states. PPD adapts to varying SLOs via configurable weights and seamlessly integrates with traditional PD deployments. With extensive evaluations, we show that PPD reduces Turn 2+ time-to-first-token (TTFT) by $\sim$68\% while maintaining competitive time-per-output-token (TPOT), effectively alleviating KV transfer congestion under high load. PPD provides a flexible and efficient paradigm for multi-turn LLM serving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PPD disaggregation cuts multi-turn TTFT by routing append-prefills locally on decode nodes, but the push for dynamic weights over a good fixed policy lacks a direct head-to-head.

read the letter

The core contribution is recognizing that append-prefill (new tokens only, reusing KV cache) creates far less decode interference than a full prefill, so routing those requests locally on decode nodes avoids repeated KV transfers. They report this drops Turn 2+ TTFT by roughly 68% while keeping TPOT competitive under high load. That matches real chatbot and agent workloads where most turns are short continuations rather than brand-new long prompts. The system integrates with existing PD setups and adds configurable weights to adapt to different SLO targets, which is a reasonable engineering move when one static rule won't cover every latency or throughput goal at once. Evaluations appear to cover varying loads and show the KV congestion relief, which is the practical payoff. The paper is aimed at systems people already running disaggregated inference who need to handle chatty multi-turn traffic without adding more hardware. It deserves a serious referee because the baseline inefficiency is real and the proposed fix is concrete and measurable. The main gap is that the claim no fixed routing strategy works for all SLOs rests on analysis rather than showing an optimized static policy (say, always-local append-prefill or a simple length threshold) fails to match PPD performance across the same workloads. If that comparison is missing or weak, the dynamic machinery could be overkill. Still, the numbers and the insight on append vs full prefill stand on their own even if the dynamic part needs tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes PPD (Prefill Prefill-capable Decode) disaggregation for multi-turn LLM serving. Building on standard Prefill-Decode (PD) architectures, it identifies two inefficiencies: repeated full prefill of new prompts each turn and KV-cache transfer congestion between nodes. The core insight is that append-prefill (processing only new tokens while reusing cached KV states) incurs far less decoding slowdown than full prefill. This motivates routing append-prefill locally on decode nodes. The paper argues that no single fixed routing policy meets all SLOs simultaneously, and therefore introduces dynamic routing with configurable weights that adapts to varying SLO targets. Extensive evaluations are reported to show a ~68% reduction in Turn 2+ TTFT while preserving competitive TPOT and relieving KV-transfer pressure under high load.

Significance. If the performance claims are substantiated with proper fixed-policy baselines, the work offers a pragmatic extension to existing PD deployments that could meaningfully improve latency for multi-turn chat and agentic workloads. The emphasis on empirical system measurements and the claim of seamless integration with conventional PD setups are strengths. However, the overall significance is tempered by the absence of direct comparisons that would confirm the necessity of dynamic routing over simpler static alternatives.

major comments (2)

[Abstract and motivation/analysis section] The load-bearing claim that 'no single fixed routing strategy satisfies all SLOs simultaneously' (Abstract and motivation section) is not supported by reported comparisons against optimized static policies such as always-local append-prefill routing or fixed token-length thresholds. Without exhaustive evaluation of such baselines across the same SLO targets, workloads, and load regimes used for PPD, the justification for introducing dynamic routing and configurable weights remains unestablished.
[Evaluation summary (Abstract)] The central empirical result of ~68% Turn 2+ TTFT reduction (Abstract and evaluation summary) is presented without sufficient detail on the precise baselines, workload definitions, number of runs, error bars, or potential post-hoc workload selection. These omissions make it difficult to assess whether the reported gains are robust or could be matched by simpler static append-prefill policies.

minor comments (1)

[Abstract] The abstract refers to 'comprehensive analysis' demonstrating that no fixed strategy works, yet provides no section pointer or concise summary of that analysis, hindering quick location of the supporting evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the paper without misrepresenting our existing results.

read point-by-point responses

Referee: [Abstract and motivation/analysis section] The load-bearing claim that 'no single fixed routing strategy satisfies all SLOs simultaneously' (Abstract and motivation section) is not supported by reported comparisons against optimized static policies such as always-local append-prefill routing or fixed token-length thresholds. Without exhaustive evaluation of such baselines across the same SLO targets, workloads, and load regimes used for PPD, the justification for introducing dynamic routing and configurable weights remains unestablished.

Authors: We appreciate the referee pointing out the need for stronger empirical grounding of this claim. The motivation section presents analysis showing that static policies (including always-local append-prefill and fixed token-length thresholds) fail to simultaneously meet all SLO targets under varying load and workload conditions, which motivated the dynamic approach. However, we acknowledge that more exhaustive side-by-side comparisons against these optimized static policies across the full range of SLO targets, workloads, and load regimes would better substantiate the necessity of dynamic routing. We will add these comparisons in the revised manuscript. revision: yes
Referee: [Evaluation summary (Abstract)] The central empirical result of ~68% Turn 2+ TTFT reduction (Abstract and evaluation summary) is presented without sufficient detail on the precise baselines, workload definitions, number of runs, error bars, or potential post-hoc workload selection. These omissions make it difficult to assess whether the reported gains are robust or could be matched by simpler static append-prefill policies.

Authors: We agree that greater transparency on the experimental setup is required. The reported ~68% TTFT reduction is measured against standard PD disaggregation baselines under the multi-turn workloads described in the evaluation section. In the revision we will expand the abstract and evaluation sections to explicitly detail the precise baselines, workload definitions (including turn distributions and request patterns), number of runs, error bars, and the workload selection methodology to confirm the results are robust and not matched by simpler static append-prefill policies. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical analysis and evaluation

full rationale

The paper presents an empirical systems proposal for PPD disaggregation. Its central claims derive from observed performance differences between full prefill and append-prefill (stated as an order-of-magnitude slowdown difference) and from analysis showing fixed routing limitations, followed by direct measurement of TTFT/TPOT improvements under load. No equations, fitted parameters, or self-citation chains are used to derive the reported reductions; the 68% TTFT figure is an evaluation outcome, not a constructed prediction. The motivation for dynamic routing is presented as an analysis result rather than a self-definitional or fitted-input reduction. The derivation chain is therefore self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that append-prefill costs are dramatically lower and that dynamic routing is necessary; no free parameters are explicitly fitted in the abstract description.

free parameters (1)

configurable weights for SLO adaptation
Weights used to balance routing decisions under varying service level objectives; chosen to adapt the system.

axioms (1)

domain assumption Append-prefill incurs an order-of-magnitude smaller decoding slowdown than full prefill.
Key insight from comprehensive analysis invoked to justify local routing on decode nodes.

pith-pipeline@v0.9.0 · 5593 in / 1382 out tokens · 76808 ms · 2026-05-15T14:14:34.913025+00:00 · methodology

Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)