Fail-Closed Lowering of Resident KV Claims onto LLM Serving Runtimes

Lukas Stepanek

arxiv: 2606.01387 · v1 · pith:FBLBR2HKnew · submitted 2026-05-31 · 💻 cs.DC

Fail-Closed Lowering of Resident KV Claims onto LLM Serving Runtimes

Lukas Stepanek This is my paper

Pith reviewed 2026-06-28 16:17 UTC · model grok-4.3

classification 💻 cs.DC

keywords resident kv claimsfail-closed loweringllm serving runtimeskv cachefuture reuselowering relationconformance checker

0 comments

The pith

A conformant lowering of resident KV claims requires binding to claim identity, materialization predicate, ordered lifecycle events, and claim-scoped outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM serving runtimes provide KV-cache primitives such as retention priority and offload that appear to support future reuse but do not necessarily accept the obligation. The paper defines when a primitive or adapter can be treated as satisfying an accepted ResidentClaim about future KV reuse. Conformant lowering demands explicit bindings to the claim's identity, a predicate for materialization, ordered lifecycle events, and outcomes scoped to the claim. It introduces a fail-closed lowering relation and a checker to classify mappings from runtimes to these obligations as native, evidence-based, or rejected. The work shows that while some systems offer substrates, only a specific patched vLLM achieves the binding through real offload behavior leading to fail-closed outcomes.

Core claim

The central discovery is the fail-closed lowering relation for ResidentClaim onto LLM serving runtimes. A runtime primitive satisfies an accepted future-KV obligation only when its behavior is bound to the claim identity, a materialization predicate, ordered lifecycle events, and claim-scoped outcomes. The relation classifies mappings as native conformance, adapter evidence, or rejected, and the checker validates descriptors against obligation bundles without proving unaudited behavior. A positive example is a local patched vLLM where claim metadata flows through offload/load and restoration failure hits the invalid-KV-load path as an ordered claim-scoped outcome.

What carries the argument

The fail-closed lowering relation that enforces binding of runtime behavior to accepted claim identity, materialization predicate, ordered lifecycle events, and claim-scoped outcomes for satisfying future KV reuse obligations.

If this is right

Primitives like priority and offload without the bindings do not accept responsibility for future KV reuse.
The checker can classify runtime/mode mappings into native, adapter-observational, policy evidence, approximation, rejected, or unknown.
Public systems like TensorRT-LLM and SGLang expose strong substrates but lack native ResidentClaim conformance.
A patched vLLM connector demonstrates conformance via in-process offload with claim metadata and fail-closed restoration failure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This semantics boundary could inform the design of future KV management standards across serving frameworks.
Applying the descriptor format to additional runtimes might reveal more approximation substrates or policy evidence cases.
Extending the fail-closed approach to other resource claims like memory or compute in distributed systems may be possible.

Load-bearing premise

Runtime primitives, trusted adapters, or patches can be treated as satisfying an accepted claim about future KV reuse when the binding conditions of identity, predicate, events, and outcomes are met.

What would settle it

Demonstration of a runtime primitive that satisfies all listed binding conditions yet permits a claimed KV to be lost without triggering a claim-scoped fail-closed outcome would falsify the definition of conformant lowering.

Figures

Figures reproduced from arXiv: 2606.01387 by Lukas Stepanek.

**Figure 2.** Figure 2: Patched vLLM connector/scheduler-boundary witness The mechanism has three trust boundaries. Native vLLM supplies connector lookup, store/load job creation, worker transfer submission/completion, and failed-load propagation into scheduler invalid-block handling. The patch supplies the accepted claim metadata, lifecycle joins, controlled same-claim failure injection, scheduler-boundary restoration failure te… view at source ↗

read the original abstract

LLM serving runtimes increasingly expose KV-cache primitives that resemble future-reuse controls: retention priority, TTL-like duration, host or storage offload, block events, active no-evict scheduling, and KV-aware routing. This paper argues that such primitives are weaker than accepted future-KV obligations. A runtime can expose priority, offload, events, and routing without accepting responsibility for a future reuse claim. We study ResidentClaim lowering: when a runtime primitive, trusted adapter, or patch can be treated as satisfying an accepted claim about future KV reuse. A conformant lowering must bind behavior to accepted claim identity, a materialization predicate, ordered lifecycle events, and claim-scoped outcomes. We contribute a fail-closed lowering relation, checker, descriptor format, and bad-lowering suite that classify runtime/mode mappings as native conformance, adapter-observational evidence, adapter-policy evidence under controlled pressure, approximation substrate, rejected mapping, or unknown evidence. The checker validates manually curated, anchored runtime descriptors against obligation bundles; it does not prove that unaudited runtime behavior is complete. Public TensorRT-LLM, SGLang/HiCache, and Dynamo expose strong substrates and selected adapter positives, but not native ResidentClaim conformance. The positive systems witness is a local patched vLLM connector/scheduler-boundary mechanism: claim metadata flows through real in-process offload/load behavior, and controlled same-claim restoration failure reaches vLLM's invalid-KV-load path and becomes an ordered claim-scoped fail-closed outcome. The result is a calibrated semantics boundary, not a production performance claim or a compatibility survey.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a classification scheme for KV-cache claim lowering in LLM runtimes but treats its four binding conditions as definitional requirements rather than derived necessities.

read the letter

The main contribution here is a set of new terms and a checker for ResidentClaim lowering: when a runtime primitive or patch can count as satisfying an accepted future-KV reuse claim. It defines a fail-closed lowering relation and classifies mappings into native conformance, adapter evidence, approximation, rejected, or unknown. The abstract and example show that exposing priority, offload, or events does not by itself accept responsibility for a claim, and the patched vLLM case demonstrates metadata flowing through actual offload/load with failure routed to an ordered, claim-scoped invalid-KV path.

What works is the clear separation between exposed primitives and accepted obligations, plus the practical descriptor format and bad-lowering suite for manual auditing. The vLLM patch supplies one reproducible positive instance.

The soft spot is that the four binding conditions (claim identity, materialization predicate, ordered lifecycle events, claim-scoped outcomes) are stipulated as required for conformance without a model of KV-reuse obligations or counterexamples showing that dropping any one permits non-fail-closed behavior. The checker validates against curated bundles but explicitly does not prove runtime completeness, so the framework rests on acceptance of the definitions.

This is for people working on reliable KV management inside LLM serving systems. A reader already thinking about runtime guarantees or adapters might pick up useful distinctions. It is narrow but the topic is live, so it deserves a serious referee to see whether the full text supplies derivations or more examples that address the stipulation concern.

Referee Report

1 major / 0 minor

Summary. The paper argues that KV-cache primitives in LLM serving runtimes (priority, offload, events, routing) are weaker than accepted future-KV reuse obligations. It defines ResidentClaim lowering and states that a conformant lowering must bind to accepted claim identity, a materialization predicate, ordered lifecycle events, and claim-scoped outcomes. Contributions include a fail-closed lowering relation, an obligation-bundle checker operating on manually curated runtime descriptors, a descriptor format, and a bad-lowering suite that classifies mappings into native conformance, adapter-observational evidence, adapter-policy evidence, approximation substrate, rejected mapping, or unknown. Several systems (TensorRT-LLM, SGLang/HiCache, Dynamo) are classified as exposing strong substrates but lacking native conformance; a patched vLLM connector is presented as a positive witness where claim metadata flows through offload/load and restoration failure reaches an invalid-KV-load path as an ordered claim-scoped outcome. The checker validates against proposed bundles but explicitly disclaims proving runtime completeness.

Significance. If the framework holds, it supplies a calibrated semantics boundary for fail-closed KV-reuse claims in distributed LLM serving, distinguishing native conformance from various evidence levels. The explicit disclaimer on checker scope and the concrete vLLM patch example (showing in-process offload/load behavior and claim-scoped failure) are strengths that make the contribution falsifiable and practically grounded rather than purely abstract.

major comments (1)

[Abstract (definition of conformant lowering)] Abstract (definition of conformant lowering): The claim that a conformant lowering 'must bind behavior to accepted claim identity, a materialization predicate, ordered lifecycle events, and claim-scoped outcomes' is introduced by stipulation rather than derived from a formal semantics of accepted future-KV claims or supported by counterexamples showing that omitting any one binding permits non-fail-closed behavior. This is load-bearing for the fail-closed lowering relation and the subsequent classification scheme.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the recommendation for major revision. We address the single major comment below and agree that the definition requires additional grounding.

read point-by-point responses

Referee: [Abstract (definition of conformant lowering)] Abstract (definition of conformant lowering): The claim that a conformant lowering 'must bind behavior to accepted claim identity, a materialization predicate, ordered lifecycle events, and claim-scoped outcomes' is introduced by stipulation rather than derived from a formal semantics of accepted future-KV claims or supported by counterexamples showing that omitting any one binding permits non-fail-closed behavior. This is load-bearing for the fail-closed lowering relation and the subsequent classification scheme.

Authors: We acknowledge that the definition is presented as a working definition motivated by the semantics of accepted future-KV reuse obligations rather than derived from a fully formal model or accompanied by explicit counterexamples in the current manuscript. Each binding is intended to enforce that violations produce observable, claim-scoped failure rather than silent incorrect reuse. In revision we will add a dedicated subsection (likely in Section 2 or 3) that (1) derives the four bindings from the requirement that an accepted claim must produce fail-closed outcomes under any violation and (2) supplies short counterexamples for each omitted binding (e.g., missing identity permits cross-claim pollution; missing materialization predicate allows stale KV to be treated as valid; unordered events can mask restoration failures; non-scoped outcomes allow side effects outside the claim). This will make the load-bearing relation and classification scheme more rigorously supported while preserving the existing empirical classification results. revision: yes

Circularity Check

0 steps flagged

No circularity; definitional framework with no reduction to inputs or self-referential equations

full rationale

The paper introduces the concept of ResidentClaim lowering and states that a conformant lowering must bind to four specific elements (claim identity, materialization predicate, lifecycle events, claim-scoped outcomes). This is presented as part of the definitional contribution of the fail-closed lowering relation and checker, not as a derived result from a prior model or equations that would reduce by construction. No fitted parameters are renamed as predictions, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs. The checker is explicitly limited to validating curated descriptors against the proposed bundles without claiming to prove runtime completeness or necessity of the bindings via counterexamples or formal semantics. The vLLM patch is described as a positive witness satisfying the conditions, not as establishing minimality. The overall contribution is a classification scheme and semantics boundary, which is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no information is available on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5818 in / 1106 out tokens · 34844 ms · 2026-06-28T16:17:14.134650+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Fail-Closed Lowering of Resident KV Claims onto LLM Serving Runtimes

Introduction KV-cache reuse has become an explicit systems surface in LLM serving. Production and research runtimes expose token-range retention priorities, duration fields, block stored/removed events, GPU/host/storage cache tiers, load- back paths, active no-evict modes, and KV-aware routing. These mechanisms are real and important. They also create a t...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Contributions This paper makes four concrete contributions
[3]

It defines an obligation-based lowering model for ResidentClaim modes: best_effort, soft_priority, hard_ protected, demotable, expiring, offloadable, and routed_reuse
[4]

The checker counts obligations only when evidence is supported and anchored, requires anchored observed evidence atoms, and keeps adapter depth separate from classification

It implements a fail-closed checker and false-positive suite over machine-readable runtime descriptors. The checker counts obligations only when evidence is supported and anchored, requires anchored observed evidence atoms, and keeps adapter depth separate from classification. Descriptor and evidence mutation controls fail closed in 16/16 cases
[5]

The resulting matrix distinguishes adapter-observational evidence, adapter-policy evidence under controlled pressure, approximation substrates, rejected lowerings, and unknown rows

It studies the boundary of public TensorRT-LLM, SGLang/HiCache, Dynamo-style KV routing, and vLLM surfaces. The resulting matrix distinguishes adapter-observational evidence, adapter-policy evidence under controlled pressure, approximation substrates, rejected lowerings, and unknown rows
[6]

It demonstrates a local patched vLLM connector/scheduler-boundary mechanism at backend_patch depth. The repeated scheduler-boundary evaluation records 131/131 completed subprocesses, 131/131 valid event sequences, 30/30 successful observation passes, 30/30 same-claim scheduler-boundary failure-outcome passes, and fail-closed rejection of wrong-claim, uncl...
[7]

acceptance

ResidentClaim Obligations ResidentClaim lowering is strict because accepted responsibility is strict. The obligations are not arbitrary hurdles for existing systems; they define a conservative audit boundary for deciding whether a future-reuse claim was satisfied, 3 demoted, expired, restored, refused, harmed, or simply never accepted. Each obligation blo...
[8]

Missing required obligations fail closed

Lowering Relation and Checker The core judgment is: backend + adapter + evidence |= ResidentClaim mode A backend/adapter/evidence tuple lowers a mode only if every required obligation for that mode is represented by the native backend or by an adapter whose depth and preconditions allow it to supply that obligation. Missing required obligations fail close...
[9]

The current depth ladder is: T able 4: Adapter depths and trust boundaries

Adapter Depths and Trust Boundaries The adapter boundary is part of the result because an adapter is part of the trusted computing base for any adapter-scoped row. The current depth ladder is: T able 4: Adapter depths and trust boundaries. Adapter depth Meaning in this study none Only native backend obligations count. telemetry_join External registry and ...
[10]

It summarizes the generated matrix and the boundary memos without treating feature names as conformance

External Runtime Boundary Studies The central result is a semantic lowering table, not a feature compatibility chart. It summarizes the generated matrix and the boundary memos without treating feature names as conformance. T able 6: Runtime boundary study summary . Substrate Best current evidence Fail-closed boundary Patched vLLM connector/scheduler bound...

2026
[11]

accepted

Patched vLLM Connector/Scheduler-Boundary Mechanism The local patched vLLM connector/scheduler-boundary mechanism supplies the missing offloadable lifecy- cle/outcome witness at backend_patch depth. It patches the vLLM pydev OffloadingConnector path rather than replacing the offload path with a standalone simulator. Native vLLM supplies real in-process co...
[12]

Evaluation The evaluation answers four questions. The first two are matrix questions: does the checker reject false positives, and do studied runtimes natively satisfy the obligations under current public evidence? The second two are mechanism questions: can the missing offloadable semantics be implemented in a real runtime path, and is the mechanism stab...
[13]

Each case is checked against the same obligation relation as the main matrix

False Positive Counterexamples The bad-lowering suite records feature-table inferences that a less strict study might accidentally call supported. Each case is checked against the same obligation relation as the main matrix. T able 9: F alse-positive counterexamples. Naive inference Checker result Why it fails priority_value_in_event -> soft_priority appr...
[14]

Limitation Consequence No native conformance is shown for public TensorRT-LLM, SGLang/HiCache, Dynamo, or upstream vLLM evidence

Limitations and Threats to Validity T able 10: Limitations and consequences. Limitation Consequence No native conformance is shown for public TensorRT-LLM, SGLang/HiCache, Dynamo, or upstream vLLM evidence. The positive claims are adapter-scoped or patch-scoped, not native backend support. The patched vLLM connector result is local backend_ patch evidence...
[15]

Related Work and Prior-Art Boundary TensorRT-LLM is the closest primitive-level comparator. Its versioned KV-cache documentation describes cross-request reuse, prioritized LRU, retention priority/duration fields, and secondary-memory offload ( NVIDIA TensorRT-LLM KV Cache System, commit 06cff70502). This paper treats those mechanisms as serious substrates...

work page arXiv 2023
[16]

Artifact A vailability and Reproducibility Notes The audit surface for this paper is the curated artifact repository resident-kv-lowering-artifact at commit b9f82f456e56e48454a9b4e0c608c2c783d0cbdb: https://github.com/gustavgauge/resident-kv-lowering-artifact.git The curated snapshot contains the checker, capability descriptors, generated matrix, bad-lowe...
[17]

Conclusion ResidentClaim lowering is an obligation problem, not a feature-name problem. TensorRT-LLM, SGLang/HiCache, Dynamo-style routing, and vLLM connector paths all expose useful KV mechanisms, but useful mechanisms do not automatically become accepted future-KV obligations. The fail-closed checker and boundary studies show how the artifact makes that...
[18]

Proceedings of the 29th Symposium on Operating Systems Principles , pages =

References • Woosuk Kwon et al. “Eﬀicient Memory Management for Large Language Model Serving with PagedAttention. ” SOSP 2023. DOI: https://doi.org/10.1145/3600006.3613165. • NVIDIA. “TensorRT-LLM KV Cache System. ” Versioned documentation at commit 06cff70502, accessed 2026-05-23. https://github.com/NVIDIA/TensorRT-LLM/blob/06cff70502/docs/source/feature...

work page doi:10.1145/3600006.3613165 2023
[19]

Pie: A Programmable Serving System for Emerging LLM Applications,

https://doi.org/10.48550/arXiv.2510.24051. • Ruoyu Qin et al. “Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. ” ACM Trans- actions on Storage, 2025. https://doi.org/10.1145/3773772. • Yuhan Liu et al. “LMCache: An Eﬀicient KV Cache Layer for Enterprise-Scale LLM Inference. ” arXiv:2510.09665, 2025. https://doi.org/10.48550/arXiv.2...

work page doi:10.48550/arxiv.2510.24051 2025
[20]

Flowkv: A disaggregated inference framework with low-latency kv cache transfer and load-aware scheduling.arXiv preprint arXiv:2504.03775, 2025

https://doi.org/10.48550/arXiv.2504.03775. • Shi Qiu et al. “Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving. ” arXiv:2605.03375, 2026. https://doi.org/10.48550/arXiv.2605.03375. • NVIDIA. “FlexKV. ” Dynamo documentation, accessed 2026-05-23. https://docs.nvidia.com/dynamo/integra tions/flex-kv. 24

work page doi:10.48550/arxiv.2504.03775 2026

[1] [1]

Fail-Closed Lowering of Resident KV Claims onto LLM Serving Runtimes

Introduction KV-cache reuse has become an explicit systems surface in LLM serving. Production and research runtimes expose token-range retention priorities, duration fields, block stored/removed events, GPU/host/storage cache tiers, load- back paths, active no-evict modes, and KV-aware routing. These mechanisms are real and important. They also create a t...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Contributions This paper makes four concrete contributions

[3] [3]

It defines an obligation-based lowering model for ResidentClaim modes: best_effort, soft_priority, hard_ protected, demotable, expiring, offloadable, and routed_reuse

[4] [4]

The checker counts obligations only when evidence is supported and anchored, requires anchored observed evidence atoms, and keeps adapter depth separate from classification

It implements a fail-closed checker and false-positive suite over machine-readable runtime descriptors. The checker counts obligations only when evidence is supported and anchored, requires anchored observed evidence atoms, and keeps adapter depth separate from classification. Descriptor and evidence mutation controls fail closed in 16/16 cases

[5] [5]

The resulting matrix distinguishes adapter-observational evidence, adapter-policy evidence under controlled pressure, approximation substrates, rejected lowerings, and unknown rows

It studies the boundary of public TensorRT-LLM, SGLang/HiCache, Dynamo-style KV routing, and vLLM surfaces. The resulting matrix distinguishes adapter-observational evidence, adapter-policy evidence under controlled pressure, approximation substrates, rejected lowerings, and unknown rows

[6] [6]

It demonstrates a local patched vLLM connector/scheduler-boundary mechanism at backend_patch depth. The repeated scheduler-boundary evaluation records 131/131 completed subprocesses, 131/131 valid event sequences, 30/30 successful observation passes, 30/30 same-claim scheduler-boundary failure-outcome passes, and fail-closed rejection of wrong-claim, uncl...

[7] [7]

acceptance

ResidentClaim Obligations ResidentClaim lowering is strict because accepted responsibility is strict. The obligations are not arbitrary hurdles for existing systems; they define a conservative audit boundary for deciding whether a future-reuse claim was satisfied, 3 demoted, expired, restored, refused, harmed, or simply never accepted. Each obligation blo...

[8] [8]

Missing required obligations fail closed

Lowering Relation and Checker The core judgment is: backend + adapter + evidence |= ResidentClaim mode A backend/adapter/evidence tuple lowers a mode only if every required obligation for that mode is represented by the native backend or by an adapter whose depth and preconditions allow it to supply that obligation. Missing required obligations fail close...

[9] [9]

The current depth ladder is: T able 4: Adapter depths and trust boundaries

Adapter Depths and Trust Boundaries The adapter boundary is part of the result because an adapter is part of the trusted computing base for any adapter-scoped row. The current depth ladder is: T able 4: Adapter depths and trust boundaries. Adapter depth Meaning in this study none Only native backend obligations count. telemetry_join External registry and ...

[10] [10]

It summarizes the generated matrix and the boundary memos without treating feature names as conformance

External Runtime Boundary Studies The central result is a semantic lowering table, not a feature compatibility chart. It summarizes the generated matrix and the boundary memos without treating feature names as conformance. T able 6: Runtime boundary study summary . Substrate Best current evidence Fail-closed boundary Patched vLLM connector/scheduler bound...

2026

[11] [11]

accepted

Patched vLLM Connector/Scheduler-Boundary Mechanism The local patched vLLM connector/scheduler-boundary mechanism supplies the missing offloadable lifecy- cle/outcome witness at backend_patch depth. It patches the vLLM pydev OffloadingConnector path rather than replacing the offload path with a standalone simulator. Native vLLM supplies real in-process co...

[12] [12]

Evaluation The evaluation answers four questions. The first two are matrix questions: does the checker reject false positives, and do studied runtimes natively satisfy the obligations under current public evidence? The second two are mechanism questions: can the missing offloadable semantics be implemented in a real runtime path, and is the mechanism stab...

[13] [13]

Each case is checked against the same obligation relation as the main matrix

False Positive Counterexamples The bad-lowering suite records feature-table inferences that a less strict study might accidentally call supported. Each case is checked against the same obligation relation as the main matrix. T able 9: F alse-positive counterexamples. Naive inference Checker result Why it fails priority_value_in_event -> soft_priority appr...

[14] [14]

Limitation Consequence No native conformance is shown for public TensorRT-LLM, SGLang/HiCache, Dynamo, or upstream vLLM evidence

Limitations and Threats to Validity T able 10: Limitations and consequences. Limitation Consequence No native conformance is shown for public TensorRT-LLM, SGLang/HiCache, Dynamo, or upstream vLLM evidence. The positive claims are adapter-scoped or patch-scoped, not native backend support. The patched vLLM connector result is local backend_ patch evidence...

[15] [15]

Related Work and Prior-Art Boundary TensorRT-LLM is the closest primitive-level comparator. Its versioned KV-cache documentation describes cross-request reuse, prioritized LRU, retention priority/duration fields, and secondary-memory offload ( NVIDIA TensorRT-LLM KV Cache System, commit 06cff70502). This paper treats those mechanisms as serious substrates...

work page arXiv 2023

[16] [16]

Artifact A vailability and Reproducibility Notes The audit surface for this paper is the curated artifact repository resident-kv-lowering-artifact at commit b9f82f456e56e48454a9b4e0c608c2c783d0cbdb: https://github.com/gustavgauge/resident-kv-lowering-artifact.git The curated snapshot contains the checker, capability descriptors, generated matrix, bad-lowe...

[17] [17]

Conclusion ResidentClaim lowering is an obligation problem, not a feature-name problem. TensorRT-LLM, SGLang/HiCache, Dynamo-style routing, and vLLM connector paths all expose useful KV mechanisms, but useful mechanisms do not automatically become accepted future-KV obligations. The fail-closed checker and boundary studies show how the artifact makes that...

[18] [18]

Proceedings of the 29th Symposium on Operating Systems Principles , pages =

References • Woosuk Kwon et al. “Eﬀicient Memory Management for Large Language Model Serving with PagedAttention. ” SOSP 2023. DOI: https://doi.org/10.1145/3600006.3613165. • NVIDIA. “TensorRT-LLM KV Cache System. ” Versioned documentation at commit 06cff70502, accessed 2026-05-23. https://github.com/NVIDIA/TensorRT-LLM/blob/06cff70502/docs/source/feature...

work page doi:10.1145/3600006.3613165 2023

[19] [19]

Pie: A Programmable Serving System for Emerging LLM Applications,

https://doi.org/10.48550/arXiv.2510.24051. • Ruoyu Qin et al. “Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. ” ACM Trans- actions on Storage, 2025. https://doi.org/10.1145/3773772. • Yuhan Liu et al. “LMCache: An Eﬀicient KV Cache Layer for Enterprise-Scale LLM Inference. ” arXiv:2510.09665, 2025. https://doi.org/10.48550/arXiv.2...

work page doi:10.48550/arxiv.2510.24051 2025

[20] [20]

Flowkv: A disaggregated inference framework with low-latency kv cache transfer and load-aware scheduling.arXiv preprint arXiv:2504.03775, 2025

https://doi.org/10.48550/arXiv.2504.03775. • Shi Qiu et al. “Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving. ” arXiv:2605.03375, 2026. https://doi.org/10.48550/arXiv.2605.03375. • NVIDIA. “FlexKV. ” Dynamo documentation, accessed 2026-05-23. https://docs.nvidia.com/dynamo/integra tions/flex-kv. 24

work page doi:10.48550/arxiv.2504.03775 2026