pith. sign in

arxiv: 2604.16395 · v3 · pith:OZLX7HA3new · submitted 2026-03-29 · 💻 cs.DB · cs.AI

Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)

Pith reviewed 2026-05-21 10:45 UTC · model grok-4.3

classification 💻 cs.DB cs.AI
keywords LLM servingcontext streamingtime-to-first-tokenprefill-decode disaggregationadaptive schedulinglongest common prefix matchingretrieval augmented generationmemory pressure management
0
0 comments X

The pith

Stream2LLM overlaps context streaming with LLM prefill to cut time-to-first-token by up to 11 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Stream2LLM, a system for handling streaming context in large language model inference under concurrent requests. It allows retrieval to happen incrementally while overlapping with the model's prefill and decode phases, rather than waiting for complete context or starting without it. This addresses the latency-quality tradeoff in retrieval-augmented generation by using adaptive scheduling, preemption based on cost models, and longest common prefix matching to avoid recomputing unchanged parts of the context. A sympathetic reader would care because in real-world applications like web search or knowledge retrieval, full context fetching can delay the first response token significantly, and this method promises faster responses without quality loss or reduced overall throughput.

Core claim

Stream2LLM is a streaming-aware LLM serving system for concurrent prefill-decode disaggregated deployments that introduces adaptive scheduling and preemption for append-mode and update-mode retrieval patterns, decouples scheduling from resource acquisition using hardware-specific cost models, and applies longest common prefix matching to minimize redundant computation on dynamic inputs, achieving up to 11x TTFT improvements on large-scale workloads from web crawling and approximate nearest neighbor search while maintaining throughput parity with non-streaming baselines.

What carries the argument

Adaptive scheduling and preemption guided by hardware-specific cost models, combined with longest common prefix matching for dynamic context in append-mode progressive accumulation and update-mode iterative refinement.

If this is right

  • Streaming context can deliver up to 11x improvements in time-to-first-token for concurrent LLM requests.
  • Cost-aware scheduling is essential for benefits under high memory pressure.
  • Longest common prefix matching reduces redundant computation when contexts arrive or change dynamically.
  • Both append-mode and update-mode patterns can be supported without throughput loss compared to non-streaming approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Production LLM systems might need to redesign retrieval pipelines to support incremental context delivery to realize these gains.
  • Similar streaming techniques could extend to other AI inference tasks involving large dynamic inputs, like video or long document processing.
  • Under varying load, the preemption strategies might need tuning for different hardware setups beyond those tested.
  • Integrating this with existing disaggregated serving frameworks could become a standard practice for low-latency RAG applications.

Load-bearing premise

The two large-scale real-world streaming workloads from web crawling and approximate nearest neighbor search accurately represent the contention patterns, dynamic context arrivals, and memory pressure in production concurrent LLM serving deployments.

What would settle it

Running the system on a different set of production-like workloads with measured TTFT and throughput under memory pressure; if improvements fall below a small threshold or throughput drops significantly, the benefits would not hold.

Figures

Figures reproduced from arXiv: 2604.16395 by Chengqi Luo, Divya Mahajan, Kexin Rong, Rajveer Bachkaniwala, Richard So.

Figure 1
Figure 1. Figure 1: Context streaming overlaps retrieval with prefill, reduc￾ing TTFT by beginning inference as chunks arrive. ory improves throughput and profit margins but directly increases time-to-first-token (key metric for responsiveness). Moreover, context arrives asynchronously and at varying rates across requests, requiring the system to dynamically adapt scheduling while managing competition for shared GPU resources… view at source ↗
Figure 2
Figure 2. Figure 2: STREAM2LLM supports two retrieval patterns–append￾mode and update-mode–and uses longest common prefix (LCP) matching to minimize cache invalidation by preserving shared prefixes across input updates. to prioritize as context chunks arrive at different times, bal￾ancing responsiveness, fairness, and cache locality. Third, context retrieval itself exhibits two distinct patterns (Fig￾ure 2): append-mode, wher… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of STREAM2LLM’s two-phase scheduler that separates scheduling decisions from resource acquisition. Phase 1 determines request priority and feasibility, while Phase 2 allocates KV cache blocks and applies preemption under memory pressure. STREAM2LLM separates the decision of which re￾quest to prioritize from the mechanics of allocating GPU memory and applying preemption. This decou￾pling enables so… view at source ↗
Figure 4
Figure 4. Figure 4: STREAM2LLM request state transitions. state by offloading KV blocks to CPU memory. Upon com￾pletion, requests transition from running to finished. 4.2 KV Cache Invalidation for Streaming Inputs STREAM2LLM supports two input sequence modification modes suited to different context retrieval patterns: • Append Mode: New input chunks are appended to the existing sequences, typical in crawler-style workloads. •… view at source ↗
Figure 6
Figure 6. Figure 6: shows the distribution of inter-chunk arrival times. ANNS arrivals are tightly concentrated around 36.7 ms (me￾dian), with the bulk of the distribution spanning 1–1,000 ms. Crawler arrivals are 19× slower (median 700.7 ms) and span approximately three orders of magnitude, from tens of milliseconds to over 30 seconds. This high variability [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of chunks per query. ANNS queries are heavily skewed toward 1–3 chunks, while crawler queries are concentrated around 6–10 chunks per query. in crawler chunk arrivals creates extended idle periods for individual requests, motivating scheduling policies that can dynamically re-prioritize requests as new chunks arrive [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TTFT CCDF across load levels for the crawler and ANNS workloads on H200. Streaming achieves up to 10.8–11.0× faster median latencies than non-streaming on the crawler workload, and up to 2.49–2.63× P95 speedups on the ANNS workload. (Time-To-First-Token), the time from request arrival to first generated token, which directly captures user-facing respon￾siveness, and (2) Trace Completion Time, the total wal… view at source ↗
Figure 9
Figure 9. Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Trace completion time across QPS levels for both work￾loads. All scheduler variants achieve near-identical completion times, confirming throughput parity. A ADDITIONAL EVALUATION A.1 Cache Invalidation in Update Mode [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Tokens invalidated per request (CCDF) across QPS levels (0.25–2.0) for the ANNS workload. All STREAM2LLM schedulers show similar cache invalidation behavior. vLLM-NS has zero invalidation as it waits for complete retrieval. B.3.3 Software dependencies Dependencies are installed in a conda environment via pip. See README.md for setup instructions and requirements.txt for the full package list. B.3.4 Data s… view at source ↗
read the original abstract

Context retrieval systems for LLM inference face a critical challenge: high retrieval latency creates a fundamental tension between waiting for complete context (poor time-to-first-token) and proceeding without it (reduced quality). Streaming context incrementally--overlapping retrieval with inference--can mitigate this latency, but doing so with concurrent requests introduces new challenges: requests contend for GPU compute and memory, and scheduling must adapt to dynamic context arrivals. We present Stream2LLM, a streaming-aware LLM serving system for concurrent prefill-decode disaggregated deployments. Stream2LLM introduces adaptive scheduling and preemption for two distinct retrieval patterns: append-mode (progressive context accumulation) and update-mode (iterative refinement with cache invalidation). It decouples scheduling decisions from resource acquisition, enabling flexible preemption strategies guided by hardware-specific cost models, and uses longest common prefix matching to minimize redundant computation when input changes dynamically. To evaluate Stream2LLM, we collect two large-scale, real-world streaming workloads based on web crawling and approximate nearest neighbor search. Our evaluation demonstrates that streaming architecture delivers up to 11x TTFT improvements, with cost-aware scheduling providing critical benefits under memory pressure, all while maintaining throughput parity with non-streaming baselines. Code: https://github.com/rajveerb/stream2llm/tree/mlsys_artifact

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Stream2LLM, a streaming-aware LLM serving system for concurrent prefill-decode disaggregated deployments. It overlaps incremental context retrieval with inference to reduce TTFT, using adaptive scheduling and preemption for append-mode (progressive accumulation) and update-mode (iterative refinement with cache invalidation) patterns, hardware-specific cost models to decouple scheduling from resource acquisition, and longest common prefix matching to avoid redundant computation on dynamic inputs. Evaluation on two collected large-scale workloads (web crawling and approximate nearest neighbor search) reports up to 11x TTFT improvements, benefits from cost-aware scheduling under memory pressure, and throughput parity with non-streaming baselines.

Significance. If the empirical results are robust, the work could meaningfully advance practical LLM serving designs for retrieval-heavy or dynamically updating contexts by mitigating the retrieval-inference latency tension. The emphasis on preemption strategies and cost models under contention, combined with code release, supports potential reproducibility and extension in systems research.

major comments (2)
  1. [Evaluation] Evaluation section: the headline claims of up to 11x TTFT reduction and critical benefits from cost-aware scheduling under memory pressure rest on the two workloads faithfully reproducing production patterns of dynamic context arrivals, request concurrency, GPU contention, and memory pressure. However, the manuscript supplies no quantitative details on concurrency levels, arrival distributions, context sizes, cache invalidation frequency, or how memory pressure was induced, leaving open the possibility that measured gains from adaptive scheduling, preemption, and LCP matching are overstated if the workloads exhibit lower contention than real deployments.
  2. [§3] §3 (Scheduling and Preemption): the claim that decoupling scheduling decisions from resource acquisition enables flexible preemption guided by hardware-specific cost models is central to the throughput-parity result, yet the text does not clarify how these cost models are calibrated, whether they introduce unaccounted overhead, or how preemption interacts with the disaggregated prefill-decode setup to preserve correctness.
minor comments (2)
  1. [Abstract] Abstract: the workload descriptions ('web crawling and approximate nearest neighbor search') would benefit from one additional sentence on scale (e.g., number of requests or total tokens) to help readers immediately gauge representativeness.
  2. [Introduction] Notation: 'LCP matching' is introduced without an explicit definition or small example in the early sections; a brief inline illustration would improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and have revised the manuscript to incorporate additional details and clarifications.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the headline claims of up to 11x TTFT reduction and critical benefits from cost-aware scheduling under memory pressure rest on the two workloads faithfully reproducing production patterns of dynamic context arrivals, request concurrency, GPU contention, and memory pressure. However, the manuscript supplies no quantitative details on concurrency levels, arrival distributions, context sizes, cache invalidation frequency, or how memory pressure was induced, leaving open the possibility that measured gains from adaptive scheduling, preemption, and LCP matching are overstated if the workloads exhibit lower contention than real deployments.

    Authors: We agree that more quantitative workload characterization would strengthen the claims. In the revised manuscript we have added a new subsection (5.1.1) and Table 3 that report the following measured statistics from the collected traces: mean concurrency of 52 requests (peak 87), Poisson arrivals with rate parameter 14.2 req/s, context lengths ranging 256–131072 tokens (median 6144), cache invalidation on 41% of update-mode requests, and memory pressure induced by capping per-GPU KV-cache memory at 65% of device capacity. These values confirm contention levels comparable to the production environments from which the traces were drawn; the 11x TTFT gains and cost-aware scheduling benefits remain consistent when the evaluation is re-run with these explicit parameters. revision: yes

  2. Referee: [§3] §3 (Scheduling and Preemption): the claim that decoupling scheduling decisions from resource acquisition enables flexible preemption guided by hardware-specific cost models is central to the throughput-parity result, yet the text does not clarify how these cost models are calibrated, whether they introduce unaccounted overhead, or how preemption interacts with the disaggregated prefill-decode setup to preserve correctness.

    Authors: We appreciate the request for clarification. The hardware-specific cost models are linear regressions fitted to offline micro-benchmark data collected on the target A100 GPUs; calibration details and the resulting coefficients are now provided in Appendix B. The models add <1.8% overhead to the critical path (measured and reported in new Figure 9). Preemption preserves correctness in the disaggregated setting by (1) checkpointing the current KV-cache prefix on the prefill instance, (2) transferring the checkpoint to the decode instance via the existing KV-cache migration path, and (3) resuming from the last valid token; we have expanded Section 3.3 with a formal description of this protocol and a short correctness argument. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system design and workload evaluation

full rationale

The paper describes an engineering system (Stream2LLM) for streaming context retrieval in concurrent LLM serving, introduces adaptive scheduling, preemption, and LCP matching, then measures TTFT and throughput on two collected real-world workloads. All performance claims rest on direct runtime measurements against non-streaming baselines rather than any equations, fitted parameters renamed as predictions, or self-referential definitions. No uniqueness theorems, ansatzes, or derivation chains appear in the provided text. Workload representativeness is an external-validity concern, not a circularity issue per the analysis rules.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on standard domain assumptions about GPU contention and cache behavior in LLM serving; no free parameters or new invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Concurrent requests contend for GPU compute and memory resources during streaming context arrival
    Explicitly stated as a core challenge the system must solve.
  • domain assumption Longest common prefix matching can safely avoid redundant prefill computation when context changes dynamically
    Used as a key optimization technique in the system description.

pith-pipeline@v0.9.0 · 5786 in / 1429 out tokens · 55497 ms · 2026-05-21T10:45:54.320226+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Accessed: 2025-12-01. Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533, 2022. Wei, J., Karina, N., Chung, H. W., Jiao, Y . J., Papay, S., Glaese, A., Schulman, J., and Fedus, W. Measuring short- form factuality in large lang...

  2. [2]

    Data Organization

    USENIX Association. ISBN 978-1-939133-40-3. URL https://www.usenix.org/conference/ osdi24/presentation/zhong-yinmin. Stream2LLM: Overlap Context Streaming and Prefill for Reduced TTFT Figure 10. Trace completion time across QPS levels for both work- loads. All scheduler variants achieve near-identical completion times, confirming throughput parity. A ADDI...