arXiv preprint arXiv:2503.16525 (2025)

Huan Yang, Renji Zhang, Mingzhe Huang, Weijun Wang, Yin Tang, Yuanchun Li, Yunxin Liu, Deyu Zhang · 2025 · arXiv 2503.16525

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

Leyline: KV Cache Directives for Agentic Inference

cs.DC · 2026-05-31 · unverdicted · novelty 7.0

Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.

Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

cs.CL · 2026-04-09 · unverdicted · novelty 6.0

Dual-pool token-budget routing for LLM serving reduces GPU-hours by 31-42% and preemption rates by 5.4x through online-learned request classification without a tokenizer.

OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism

cs.RO · 2026-03-15 · unverdicted · novelty 6.0

OxyGen unifies KV cache management in MoT VLAs to enable cross-task KV sharing and cross-frame continuous batching, delivering up to 3.7x speedup with 200+ tokens/s language and 70 Hz action on on-device platforms.

SparseX: Efficient Segment-Level KV Cache Sharing for Interleaved LLM Serving

cs.PF · 2026-06-01 · unverdicted · novelty 5.0

SparseX adds segment-level KV cache reuse with Sparse-Q guided recomputation and layer-wise hybrid attention to handle interleaved serving patterns beyond standard prefix caching.

citing papers explorer

Showing 4 of 4 citing papers.

Leyline: KV Cache Directives for Agentic Inference cs.DC · 2026-05-31 · unverdicted · none · ref 43
Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.
Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving cs.CL · 2026-04-09 · unverdicted · none · ref 19
Dual-pool token-budget routing for LLM serving reduces GPU-hours by 31-42% and preemption rates by 5.4x through online-learned request classification without a tokenizer.
OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism cs.RO · 2026-03-15 · unverdicted · none · ref 45
OxyGen unifies KV cache management in MoT VLAs to enable cross-task KV sharing and cross-frame continuous batching, delivering up to 3.7x speedup with 200+ tokens/s language and 70 Hz action on on-device platforms.
SparseX: Efficient Segment-Level KV Cache Sharing for Interleaved LLM Serving cs.PF · 2026-06-01 · unverdicted · none · ref 14
SparseX adds segment-level KV cache reuse with Sparse-Q guided recomputation and layer-wise hybrid attention to handle interleaved serving patterns beyond standard prefix caching.

arXiv preprint arXiv:2503.16525 (2025)

fields

years

verdicts

representative citing papers

citing papers explorer