Prompt cache: Modular attention reuse for low-latency inference

In Gim, Guojun Chen, Seung seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong · 2024 · arXiv 2311.04934

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2 baseline 1

citation-polarity summary

background 2 baseline 1

representative citing papers

Continuous Semantic Caching for Low-Cost LLM Serving

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Establishes the first rigorous framework for continuous semantic caching of LLM responses using ε-net discretization and kernel ridge regression, with sublinear regret bounds.

PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems

cs.CR · 2026-03-11 · unverdicted · novelty 7.0

PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.

Efficient Remote KV Cache Reuse with GPU-native Video Codec

cs.DC · 2026-02-10 · conditional · novelty 7.0

KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.

SGLang: Efficient Execution of Structured Language Model Programs

cs.AI · 2023-12-12 · conditional · novelty 6.0

SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.

HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

cs.DC · 2026-04-18 · unverdicted · novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and

Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack

cs.AI · 2026-05-11 · unverdicted · novelty 4.0

Workload-aware optimizations for LLM serving in AML and fraud detection yield substantial gains in throughput, latency, and GPU utilization on synthetic compliance prompts.

citing papers explorer

Showing 6 of 6 citing papers.

Continuous Semantic Caching for Low-Cost LLM Serving cs.LG · 2026-04-21 · unverdicted · none · ref 10
Establishes the first rigorous framework for continuous semantic caching of LLM responses using ε-net discretization and kernel ridge regression, with sublinear regret bounds.
PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems cs.CR · 2026-03-11 · unverdicted · none · ref 27
PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
Efficient Remote KV Cache Reuse with GPU-native Video Codec cs.DC · 2026-02-10 · conditional · none · ref 30
KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
SGLang: Efficient Execution of Structured Language Model Programs cs.AI · 2023-12-12 · conditional · none · ref 12
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention cs.DC · 2026-04-18 · unverdicted · none · ref 52
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack cs.AI · 2026-05-11 · unverdicted · none · ref 3
Workload-aware optimizations for LLM serving in AML and fraud detection yield substantial gains in throughput, latency, and GPU utilization on synthetic compliance prompts.

Prompt cache: Modular attention reuse for low-latency inference

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer