Resident KV Claims: A Conformance Contract for Future Reuse under Active KV Pressure

Lukas Stepanek

arxiv: 2605.24259 · v1 · pith:DWYANCR2new · submitted 2026-05-22 · 💻 cs.DC

Resident KV Claims: A Conformance Contract for Future Reuse under Active KV Pressure

Lukas Stepanek This is my paper

Pith reviewed 2026-06-30 14:15 UTC · model grok-4.3

classification 💻 cs.DC

keywords resident KV claimsKV cache reuseconformance contractvLLM allocatoractive refusalscheduler arbitrationmaterialization predicatewrite no-admit

0 comments

The pith

Resident KV claims bind future-reuse intent to a materialization predicate and convert unreported resident loss into scheduler-visible active refusal with direct attribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces resident KV claims as a conformance contract for KV-cache systems that must handle cases where resident reusable blocks and active live requests compete for the same limited pool. Current reuse mechanisms supply hints and modes but leave undefined what happens to accepted future-reuse state when both cannot coexist, allowing silent eviction. The contract attaches each claim to a materialization predicate, lifecycle state, active/resident feasibility outcome, and claim-level telemetry. A minimal vLLM prototype demonstrates that hard protected claims change the outcome from write no-admit plus resident eviction to explicit active refusal reported to the scheduler. A companion litmus suite then reconstructs which of several distinct outcomes actually occurred.

Core claim

Resident KV claims are a conformance contract that binds future-reuse intent to a materialization predicate, lifecycle state, active/resident feasibility outcome, and claim-level telemetry. In controlled vLLM allocator probes where a 60-block resident claim and a 70-block active prefill exceed an 80-block usable KV pool, hard protected claims turn the prior failure mode of active allocation evicting residents into scheduler-visible active refusal with direct blocking-claim attribution. The result supplies a runtime contract that makes ordinary eviction, soft priority, write no-admit, accepted hard claims, materialization failure, demotion, expiry, and active refusal distinguishable and recon

What carries the argument

The resident KV claim, a conformance contract attaching future-reuse intent to materialization predicate, lifecycle state, feasibility outcome, and telemetry so that active/resident conflicts produce explicit, attributable scheduler outcomes.

If this is right

Write no-admit still permits active allocation to evict residents from the shared pool.
Hard protected resident claims produce scheduler-visible active refusal instead of silent loss.
Claim-level telemetry and the litmus suite allow reconstruction of which distinct outcome occurred.
The contract distinguishes ordinary eviction, soft priority, write no-admit, materialization failure, demotion, expiry, and active refusal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Schedulers could use the new refusal signals to prioritize or queue requests differently when resident claims are present.
The same contract structure could be applied to other shared memory pools that mix reusable and live allocations.
Portable implementation would require allocator interfaces to expose the materialization predicate and telemetry without allocator-specific extensions.

Load-bearing premise

The materialization predicate, lifecycle state, and claim-level telemetry can be implemented portably across allocators without introducing new failure modes or significant overhead.

What would settle it

Execute the vLLM litmus suite on an 80-block KV pool with a 60-block hard resident claim and 70-block active prefill; the claim is falsified if the active request is admitted and evicts the resident without producing a scheduler-visible refusal or blocking-claim attribution.

Figures

Figures reproduced from arXiv: 2605.24259 by Lukas Stepanek.

**Figure 1.** Figure 1: Resident Claim Thesis 2. Active live KV: KV required to execute an in-flight request. Under full attention, active prefill chunks accumulate unless earlier chunks are freed, offloaded, or recomputed. 3. Future reusable admission: The decision to insert newly produced active KV into a reusable prefix cache after or during service. The distinction matters because a write no-admit policy acts on the third res… view at source ↗

**Figure 2.** Figure 2: Resident claim lifecycle 3.4 Feasibility Boundary The boundary is: protected_resident_kv + active_live_kv <= usable_kv If the inequality holds, the runtime may serve active work and preserve residents in the same pool, subject to ordinary eviction and scheduling details. If the inequality does not hold, a future-reuse hint cannot be treated as an unconditional command. The runtime must choose an explicit a… view at source ↗

**Figure 3.** Figure 3: Active resident KV arbitration 4.3 vLLM Prototype Evidence The vLLM prototype is a patch-level contract hook against vLLM base commit b1388b1, not an upstream API. The published patch is intentionally narrow and prototype-grade. It adds env-gated JSONL telemetry, resident claim metadata, write no-admit for selected request ids, protected-resident victim exclusion, claim relaxation/expiry/harm events, sche… view at source ↗

**Figure 4.** Figure 4: Capacity sweep for a 60-block resident claim and 70-block active request. Native and write-noadmit policies serve active work below the 130-block feasibility boundary by losing resident reusable KV. Hard resident exclusion preserves the accepted resident claim and converts infeasible active/resident coexistence into scheduler-visible refusal. At and above 130 usable blocks, active and resident KV can coex… view at source ↗

read the original abstract

KV-cache reuse mechanisms increasingly expose priority, duration, offload, routing hints, scheduler modes, and event streams. These mechanisms help preserve reusable prefixes, but they do not by themselves define a portable contract for accepted future-reuse state when resident KV and active live KV cannot both fit. We introduce resident KV claims, a conformance contract that binds future-reuse intent to a materialization predicate, lifecycle state, active/resident feasibility outcome, and claim-level telemetry. In controlled vLLM allocator probes, a 60-block resident claim and a 70-block active prefill exceed an 80-block usable KV pool. Write no-admit prevents the active request from becoming future reusable state, but it still allows active allocation to evict residents from the shared pool. A minimal vLLM prototype shows that hard protected resident claims convert this failure mode into scheduler-visible active refusal with direct blocking-claim attribution. The result is not a production speedup or a new cache-replacement algorithm. It is a runtime contract that turns unreported resident loss into reconstructable active/resident arbitration. A companion MicroRuntime and vLLM litmus suite distinguish ordinary eviction, soft priority, write no-admit, accepted hard claims, materialization failure, demotion, expiry, active refusal, and trace-level outcome reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a conformance contract for resident KV claims to expose eviction conflicts in LLM KV caches, backed by a minimal vLLM prototype but no quantitative results yet.

read the letter

The core contribution is a new contract called resident KV claims. It ties future-reuse intent to a materialization predicate, lifecycle state, feasibility outcome, and claim-level telemetry. This turns the case where a 60-block resident claim and 70-block active prefill exceed an 80-block pool from silent resident loss into a scheduler-visible refusal with direct attribution. The abstract and prototype description make clear this is not positioned as a replacement algorithm or speedup, just a portable way to make the arbitration reconstructable.

What works is the identification of the specific failure mode under write-no-admit and the distinction from ordinary eviction, soft priority, and other states via the companion litmus suite. The contract elements are stated cleanly and the vLLM probe shows the intended outcome without internal contradiction.

The main limitation is the lack of any numbers. No overhead measurements, no comparison across allocators, and no data on how often the conflict actually arises in practice. The portability assumption for the materialization predicate and telemetry across different KV allocators is stated but not tested beyond the single minimal prototype. That keeps the result at the level of a useful definition rather than a validated mechanism.

This is aimed at systems builders working on LLM inference schedulers and memory management, especially those extending vLLM or similar runtimes. Someone already dealing with KV reuse hints and eviction policies would find the contract framing and the enumerated outcome distinctions directly applicable.

The work shows clear thinking about the runtime contract gap and honest scoping. It is worth sending to peer review so the prototype can be examined in full and the implementation details checked.

Referee Report

1 major / 2 minor

Summary. The paper proposes resident KV claims as a conformance contract for future KV-cache reuse under active pressure in systems such as vLLM. The contract includes a materialization predicate, lifecycle state, active/resident feasibility outcome, and claim-level telemetry. Using controlled probes in a minimal vLLM prototype, it demonstrates that hard protected resident claims transform a write-no-admit eviction scenario involving a 60-block resident claim and 70-block active prefill in an 80-block pool into a scheduler-visible active refusal with blocking-claim attribution. The work also introduces a MicroRuntime and vLLM litmus suite to distinguish between various outcomes including ordinary eviction, soft priority, write no-admit, accepted hard claims, materialization failure, demotion, expiry, active refusal, and trace-level outcome reconstruction. The contribution is explicitly scoped as a runtime contract rather than a production algorithm.

Significance. If the central claim holds, the resident KV claims contract would provide a structured, portable mechanism to handle conflicts between resident and active KV under memory pressure, converting unreported resident loss into explicit, reconstructable arbitration. This addresses limitations in existing KV-cache reuse mechanisms that provide various hints but lack a defined conformance contract for accepted future-reuse state. The prototype offers concrete evidence in one allocator, and the litmus suite supports verification of different scenarios. A strength is the clear scoping of the result and the focus on making failure modes distinguishable.

major comments (1)

[Abstract] Abstract: The abstract claims that the minimal vLLM prototype shows conversion of the failure mode to scheduler-visible refusal, but supplies no quantitative data, error bars, or details of the experimental setup and number of probes, which undermines the ability to assess the reliability and generality of this outcome for the central claim.

minor comments (2)

[Abstract] The example numbers (60-block, 70-block, 80-block) are given without specifying the units or the exact configuration of the usable KV pool, which could be clarified for readers.
The manuscript would benefit from a dedicated section describing the implementation of the hard protected claims in the prototype to allow replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of major revision. The single major comment is addressed below. We agree that the abstract requires additional detail on the experimental probes to strengthen the presentation of the conformance contract demonstration.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract claims that the minimal vLLM prototype shows conversion of the failure mode to scheduler-visible refusal, but supplies no quantitative data, error bars, or details of the experimental setup and number of probes, which undermines the ability to assess the reliability and generality of this outcome for the central claim.

Authors: We agree the abstract would benefit from explicit details on the probe setup. The reported outcome uses deterministic, controlled allocator probes in the minimal vLLM prototype (with the MicroRuntime) to exercise the resident KV claim contract under the specific 60-block resident / 70-block active / 80-block pool configuration. Because the tests are designed to produce a single, reproducible outcome for each litmus case rather than statistical measurements, error bars do not apply. We will revise the abstract to state the number of probes executed for the reported scenario and to briefly characterize the MicroRuntime environment. The litmus suite itself provides the mechanism for assessing generality across the enumerated outcome categories (ordinary eviction, write-no-admit, active refusal, etc.). This change will be made in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes a conformance contract definition (resident KV claims) with associated states and telemetry, then reports behavior observed in a minimal vLLM prototype. No equations, fitted parameters, or derivation steps appear in the provided text. The central claim is scoped to the contract's ability to make certain failure modes scheduler-visible; this is presented as a definitional and observational result rather than a reduction of any output to prior fitted inputs or self-citation chains. The work is self-contained against external benchmarks with no load-bearing self-references or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that KV-cache state can be partitioned into resident and active components with enforceable feasibility predicates; introduces one new entity (the claim) without independent evidence outside the proposal.

axioms (1)

domain assumption KV-cache reuse mechanisms expose priority, duration, offload, routing hints, scheduler modes, and event streams that can be extended with a conformance contract
Stated in the opening of the abstract as the starting point for the new contract

invented entities (1)

resident KV claim no independent evidence
purpose: Binds future-reuse intent to materialization predicate, lifecycle state, active/resident feasibility outcome, and claim-level telemetry
New postulated contract object introduced to address unreported resident loss

pith-pipeline@v0.9.1-grok · 5758 in / 1191 out tokens · 36712 ms · 2026-06-30T14:15:18.975683+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon et al., “Eﬀicient Memory Management for Large Language Model Serving with PagedAt- tention,” arXiv, https://arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv
[2]

vLLM prefix caching documentation, https://docs.vllm.ai/en/v0.17.0/design/prefix_caching/
[3]

[RFC]: Context-Aware KV-Cache Retention API (Prioritized Evictions),

vLLM issue “[RFC]: Context-Aware KV-Cache Retention API (Prioritized Evictions),” https://github.c om/vllm-project/vllm/issues/37003
[4]

Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM,

NVIDIA Developer Blog, “Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM,” https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/
[5]

TensorRT-LLM KV cache documentation, https://nvidia.github.io/TensorRT-LLM/features/kvcache.h tml
[6]

TensorRT-LLM useful runtime flags documentation, https://nvidia.github.io/TensorRT-LLM/performa nce/performance-tuning-guide/useful-runtime-flags.html
[7]

SGLang HiCache design documentation, https://docs.sglang.io/docs/advanced_features/hicache_desi gn
[8]

SGLang server arguments documentation, https://docs.sglang.io/docs/advanced_features/server_arg uments
[9]

NVIDIA Dynamo agentic workflow documentation, https://docs.nvidia.com/dynamo/dev/user-guides/a gents
[10]

NVIDIA Dynamo SGLang agentic workload documentation, https://docs.nvidia.com/dynamo/dev/bac kends/sg-lang/agentic-workloads
[11]

Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

H. Li et al., “Continuum: Eﬀicient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live,” arXiv, https://arxiv.org/abs/2511.02230

work page internal anchor Pith review Pith/arXiv arXiv
[12]

InInternational Conference on Learning Representations, volume 2024, pages 39578–39601

Z. Pan et al., “KVFlow: Eﬀicient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows,” arXiv, https://arxiv.org/abs/2507.07400

work page arXiv
[13]

Pie: A Programmable Serving System for Emerging LLM Applications,

In Gim et al., “Pie: A Programmable Serving System for Emerging LLM Applications,” SOSP 2025 / arXiv, https://arxiv.org/abs/2510.24051

work page arXiv 2025
[14]

Marconi: Prefix Caching for the Era of Hybrid LLMs,

Rui Pan et al., “Marconi: Prefix Caching for the Era of Hybrid LLMs,” Proceedings of Machine Learning and Systems 7 (MLSys 2025), https://proceedings.mlsys.org/paper_files/paper/2025/hash/7c180af017258d 239bac6248d1eb26ac-Abstract-Conference.html . 18

2025
[15]

vLLM x Mooncake: KV Cache-Centric Disaggregated Architecture for LLM Serv- ing,

vLLM project blog, “vLLM x Mooncake: KV Cache-Centric Disaggregated Architecture for LLM Serv- ing,” https://vllm.ai/blog/2026-05-06-mooncake-store . Appendix A: Reproducibility Inventory This inventory records the public, commit-pinned artifacts used by the manuscript. Claims in the paper are tied to the public repositories, generated artifact files, and...

2026

[1] [1]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon et al., “Eﬀicient Memory Management for Large Language Model Serving with PagedAt- tention,” arXiv, https://arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

vLLM prefix caching documentation, https://docs.vllm.ai/en/v0.17.0/design/prefix_caching/

[3] [3]

[RFC]: Context-Aware KV-Cache Retention API (Prioritized Evictions),

vLLM issue “[RFC]: Context-Aware KV-Cache Retention API (Prioritized Evictions),” https://github.c om/vllm-project/vllm/issues/37003

[4] [4]

Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM,

NVIDIA Developer Blog, “Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM,” https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/

[5] [5]

TensorRT-LLM KV cache documentation, https://nvidia.github.io/TensorRT-LLM/features/kvcache.h tml

[6] [6]

TensorRT-LLM useful runtime flags documentation, https://nvidia.github.io/TensorRT-LLM/performa nce/performance-tuning-guide/useful-runtime-flags.html

[7] [7]

SGLang HiCache design documentation, https://docs.sglang.io/docs/advanced_features/hicache_desi gn

[8] [8]

SGLang server arguments documentation, https://docs.sglang.io/docs/advanced_features/server_arg uments

[9] [9]

NVIDIA Dynamo agentic workflow documentation, https://docs.nvidia.com/dynamo/dev/user-guides/a gents

[10] [10]

NVIDIA Dynamo SGLang agentic workload documentation, https://docs.nvidia.com/dynamo/dev/bac kends/sg-lang/agentic-workloads

[11] [11]

Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

H. Li et al., “Continuum: Eﬀicient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live,” arXiv, https://arxiv.org/abs/2511.02230

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

InInternational Conference on Learning Representations, volume 2024, pages 39578–39601

Z. Pan et al., “KVFlow: Eﬀicient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows,” arXiv, https://arxiv.org/abs/2507.07400

work page arXiv

[13] [13]

Pie: A Programmable Serving System for Emerging LLM Applications,

In Gim et al., “Pie: A Programmable Serving System for Emerging LLM Applications,” SOSP 2025 / arXiv, https://arxiv.org/abs/2510.24051

work page arXiv 2025

[14] [14]

Marconi: Prefix Caching for the Era of Hybrid LLMs,

Rui Pan et al., “Marconi: Prefix Caching for the Era of Hybrid LLMs,” Proceedings of Machine Learning and Systems 7 (MLSys 2025), https://proceedings.mlsys.org/paper_files/paper/2025/hash/7c180af017258d 239bac6248d1eb26ac-Abstract-Conference.html . 18

2025

[15] [15]

vLLM x Mooncake: KV Cache-Centric Disaggregated Architecture for LLM Serv- ing,

vLLM project blog, “vLLM x Mooncake: KV Cache-Centric Disaggregated Architecture for LLM Serv- ing,” https://vllm.ai/blog/2026-05-06-mooncake-store . Appendix A: Reproducibility Inventory This inventory records the public, commit-pinned artifacts used by the manuscript. Claims in the paper are tied to the public repositories, generated artifact files, and...

2026