pith. sign in

arxiv: 2605.24259 · v1 · pith:DWYANCR2new · submitted 2026-05-22 · 💻 cs.DC

Resident KV Claims: A Conformance Contract for Future Reuse under Active KV Pressure

Pith reviewed 2026-06-30 14:15 UTC · model grok-4.3

classification 💻 cs.DC
keywords resident KV claimsKV cache reuseconformance contractvLLM allocatoractive refusalscheduler arbitrationmaterialization predicatewrite no-admit
0
0 comments X

The pith

Resident KV claims bind future-reuse intent to a materialization predicate and convert unreported resident loss into scheduler-visible active refusal with direct attribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces resident KV claims as a conformance contract for KV-cache systems that must handle cases where resident reusable blocks and active live requests compete for the same limited pool. Current reuse mechanisms supply hints and modes but leave undefined what happens to accepted future-reuse state when both cannot coexist, allowing silent eviction. The contract attaches each claim to a materialization predicate, lifecycle state, active/resident feasibility outcome, and claim-level telemetry. A minimal vLLM prototype demonstrates that hard protected claims change the outcome from write no-admit plus resident eviction to explicit active refusal reported to the scheduler. A companion litmus suite then reconstructs which of several distinct outcomes actually occurred.

Core claim

Resident KV claims are a conformance contract that binds future-reuse intent to a materialization predicate, lifecycle state, active/resident feasibility outcome, and claim-level telemetry. In controlled vLLM allocator probes where a 60-block resident claim and a 70-block active prefill exceed an 80-block usable KV pool, hard protected claims turn the prior failure mode of active allocation evicting residents into scheduler-visible active refusal with direct blocking-claim attribution. The result supplies a runtime contract that makes ordinary eviction, soft priority, write no-admit, accepted hard claims, materialization failure, demotion, expiry, and active refusal distinguishable and recon

What carries the argument

The resident KV claim, a conformance contract attaching future-reuse intent to materialization predicate, lifecycle state, feasibility outcome, and telemetry so that active/resident conflicts produce explicit, attributable scheduler outcomes.

If this is right

  • Write no-admit still permits active allocation to evict residents from the shared pool.
  • Hard protected resident claims produce scheduler-visible active refusal instead of silent loss.
  • Claim-level telemetry and the litmus suite allow reconstruction of which distinct outcome occurred.
  • The contract distinguishes ordinary eviction, soft priority, write no-admit, materialization failure, demotion, expiry, and active refusal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Schedulers could use the new refusal signals to prioritize or queue requests differently when resident claims are present.
  • The same contract structure could be applied to other shared memory pools that mix reusable and live allocations.
  • Portable implementation would require allocator interfaces to expose the materialization predicate and telemetry without allocator-specific extensions.

Load-bearing premise

The materialization predicate, lifecycle state, and claim-level telemetry can be implemented portably across allocators without introducing new failure modes or significant overhead.

What would settle it

Execute the vLLM litmus suite on an 80-block KV pool with a 60-block hard resident claim and 70-block active prefill; the claim is falsified if the active request is admitted and evicts the resident without producing a scheduler-visible refusal or blocking-claim attribution.

Figures

Figures reproduced from arXiv: 2605.24259 by Lukas Stepanek.

Figure 1
Figure 1. Figure 1: Resident Claim Thesis 2. Active live KV: KV required to execute an in-flight request. Under full attention, active prefill chunks accumulate unless earlier chunks are freed, offloaded, or recomputed. 3. Future reusable admission: The decision to insert newly produced active KV into a reusable prefix cache after or during service. The distinction matters because a write no-admit policy acts on the third res… view at source ↗
Figure 2
Figure 2. Figure 2: Resident claim lifecycle 3.4 Feasibility Boundary The boundary is: protected_resident_kv + active_live_kv <= usable_kv If the inequality holds, the runtime may serve active work and preserve residents in the same pool, subject to ordinary eviction and scheduling details. If the inequality does not hold, a future-reuse hint cannot be treated as an unconditional command. The runtime must choose an explicit a… view at source ↗
Figure 3
Figure 3. Figure 3: Active resident KV arbitration 4.3 vLLM Prototype Evidence The vLLM prototype is a patch-level contract hook against vLLM base commit b1388b1, not an upstream API. The published patch is intentionally narrow and prototype-grade. It adds env-gated JSONL teleme￾try, resident claim metadata, write no-admit for selected request ids, protected-resident victim exclusion, claim relaxation/expiry/harm events, sche… view at source ↗
Figure 4
Figure 4. Figure 4: Capacity sweep for a 60-block resident claim and 70-block active request. Native and write-no￾admit policies serve active work below the 130-block feasibility boundary by losing resident reusable KV. Hard resident exclusion preserves the accepted resident claim and converts infeasible active/resident coexistence into scheduler-visible refusal. At and above 130 usable blocks, active and resident KV can coex… view at source ↗
read the original abstract

KV-cache reuse mechanisms increasingly expose priority, duration, offload, routing hints, scheduler modes, and event streams. These mechanisms help preserve reusable prefixes, but they do not by themselves define a portable contract for accepted future-reuse state when resident KV and active live KV cannot both fit. We introduce resident KV claims, a conformance contract that binds future-reuse intent to a materialization predicate, lifecycle state, active/resident feasibility outcome, and claim-level telemetry. In controlled vLLM allocator probes, a 60-block resident claim and a 70-block active prefill exceed an 80-block usable KV pool. Write no-admit prevents the active request from becoming future reusable state, but it still allows active allocation to evict residents from the shared pool. A minimal vLLM prototype shows that hard protected resident claims convert this failure mode into scheduler-visible active refusal with direct blocking-claim attribution. The result is not a production speedup or a new cache-replacement algorithm. It is a runtime contract that turns unreported resident loss into reconstructable active/resident arbitration. A companion MicroRuntime and vLLM litmus suite distinguish ordinary eviction, soft priority, write no-admit, accepted hard claims, materialization failure, demotion, expiry, active refusal, and trace-level outcome reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes resident KV claims as a conformance contract for future KV-cache reuse under active pressure in systems such as vLLM. The contract includes a materialization predicate, lifecycle state, active/resident feasibility outcome, and claim-level telemetry. Using controlled probes in a minimal vLLM prototype, it demonstrates that hard protected resident claims transform a write-no-admit eviction scenario involving a 60-block resident claim and 70-block active prefill in an 80-block pool into a scheduler-visible active refusal with blocking-claim attribution. The work also introduces a MicroRuntime and vLLM litmus suite to distinguish between various outcomes including ordinary eviction, soft priority, write no-admit, accepted hard claims, materialization failure, demotion, expiry, active refusal, and trace-level outcome reconstruction. The contribution is explicitly scoped as a runtime contract rather than a production algorithm.

Significance. If the central claim holds, the resident KV claims contract would provide a structured, portable mechanism to handle conflicts between resident and active KV under memory pressure, converting unreported resident loss into explicit, reconstructable arbitration. This addresses limitations in existing KV-cache reuse mechanisms that provide various hints but lack a defined conformance contract for accepted future-reuse state. The prototype offers concrete evidence in one allocator, and the litmus suite supports verification of different scenarios. A strength is the clear scoping of the result and the focus on making failure modes distinguishable.

major comments (1)
  1. [Abstract] Abstract: The abstract claims that the minimal vLLM prototype shows conversion of the failure mode to scheduler-visible refusal, but supplies no quantitative data, error bars, or details of the experimental setup and number of probes, which undermines the ability to assess the reliability and generality of this outcome for the central claim.
minor comments (2)
  1. [Abstract] The example numbers (60-block, 70-block, 80-block) are given without specifying the units or the exact configuration of the usable KV pool, which could be clarified for readers.
  2. The manuscript would benefit from a dedicated section describing the implementation of the hard protected claims in the prototype to allow replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of major revision. The single major comment is addressed below. We agree that the abstract requires additional detail on the experimental probes to strengthen the presentation of the conformance contract demonstration.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract claims that the minimal vLLM prototype shows conversion of the failure mode to scheduler-visible refusal, but supplies no quantitative data, error bars, or details of the experimental setup and number of probes, which undermines the ability to assess the reliability and generality of this outcome for the central claim.

    Authors: We agree the abstract would benefit from explicit details on the probe setup. The reported outcome uses deterministic, controlled allocator probes in the minimal vLLM prototype (with the MicroRuntime) to exercise the resident KV claim contract under the specific 60-block resident / 70-block active / 80-block pool configuration. Because the tests are designed to produce a single, reproducible outcome for each litmus case rather than statistical measurements, error bars do not apply. We will revise the abstract to state the number of probes executed for the reported scenario and to briefly characterize the MicroRuntime environment. The litmus suite itself provides the mechanism for assessing generality across the enumerated outcome categories (ordinary eviction, write-no-admit, active refusal, etc.). This change will be made in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes a conformance contract definition (resident KV claims) with associated states and telemetry, then reports behavior observed in a minimal vLLM prototype. No equations, fitted parameters, or derivation steps appear in the provided text. The central claim is scoped to the contract's ability to make certain failure modes scheduler-visible; this is presented as a definitional and observational result rather than a reduction of any output to prior fitted inputs or self-citation chains. The work is self-contained against external benchmarks with no load-bearing self-references or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that KV-cache state can be partitioned into resident and active components with enforceable feasibility predicates; introduces one new entity (the claim) without independent evidence outside the proposal.

axioms (1)
  • domain assumption KV-cache reuse mechanisms expose priority, duration, offload, routing hints, scheduler modes, and event streams that can be extended with a conformance contract
    Stated in the opening of the abstract as the starting point for the new contract
invented entities (1)
  • resident KV claim no independent evidence
    purpose: Binds future-reuse intent to materialization predicate, lifecycle state, active/resident feasibility outcome, and claim-level telemetry
    New postulated contract object introduced to address unreported resident loss

pith-pipeline@v0.9.1-grok · 5758 in / 1191 out tokens · 36712 ms · 2026-06-30T14:15:18.975683+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAt- tention,” arXiv, https://arxiv.org/abs/2309.06180

  2. [2]

    vLLM prefix caching documentation, https://docs.vllm.ai/en/v0.17.0/design/prefix_caching/

  3. [3]

    [RFC]: Context-Aware KV-Cache Retention API (Prioritized Evictions),

    vLLM issue “[RFC]: Context-Aware KV-Cache Retention API (Prioritized Evictions),” https://github.c om/vllm-project/vllm/issues/37003

  4. [4]

    Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM,

    NVIDIA Developer Blog, “Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM,” https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/

  5. [5]

    TensorRT-LLM KV cache documentation, https://nvidia.github.io/TensorRT-LLM/features/kvcache.h tml

  6. [6]

    TensorRT-LLM useful runtime flags documentation, https://nvidia.github.io/TensorRT-LLM/performa nce/performance-tuning-guide/useful-runtime-flags.html

  7. [7]

    SGLang HiCache design documentation, https://docs.sglang.io/docs/advanced_features/hicache_desi gn

  8. [8]

    SGLang server arguments documentation, https://docs.sglang.io/docs/advanced_features/server_arg uments

  9. [9]

    NVIDIA Dynamo agentic workflow documentation, https://docs.nvidia.com/dynamo/dev/user-guides/a gents

  10. [10]

    NVIDIA Dynamo SGLang agentic workload documentation, https://docs.nvidia.com/dynamo/dev/bac kends/sg-lang/agentic-workloads

  11. [11]

    Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

    H. Li et al., “Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live,” arXiv, https://arxiv.org/abs/2511.02230

  12. [12]

    InInternational Conference on Learning Representations, volume 2024, pages 39578–39601

    Z. Pan et al., “KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows,” arXiv, https://arxiv.org/abs/2507.07400

  13. [13]

    Pie: A Programmable Serving System for Emerging LLM Applications,

    In Gim et al., “Pie: A Programmable Serving System for Emerging LLM Applications,” SOSP 2025 / arXiv, https://arxiv.org/abs/2510.24051

  14. [14]

    Marconi: Prefix Caching for the Era of Hybrid LLMs,

    Rui Pan et al., “Marconi: Prefix Caching for the Era of Hybrid LLMs,” Proceedings of Machine Learning and Systems 7 (MLSys 2025), https://proceedings.mlsys.org/paper_files/paper/2025/hash/7c180af017258d 239bac6248d1eb26ac-Abstract-Conference.html . 18

  15. [15]

    vLLM x Mooncake: KV Cache-Centric Disaggregated Architecture for LLM Serv- ing,

    vLLM project blog, “vLLM x Mooncake: KV Cache-Centric Disaggregated Architecture for LLM Serv- ing,” https://vllm.ai/blog/2026-05-06-mooncake-store . Appendix A: Reproducibility Inventory This inventory records the public, commit-pinned artifacts used by the manuscript. Claims in the paper are tied to the public repositories, generated artifact files, and...