Establishes the first rigorous framework for continuous semantic caching of LLM responses using ε-net discretization and kernel ridge regression, with sublinear regret bounds.
Prompt cache: Modular attention reuse for low-latency inference
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
Workload-aware optimizations for LLM serving in AML and fraud detection yield substantial gains in throughput, latency, and GPU utilization on synthetic compliance prompts.
citing papers explorer
-
Continuous Semantic Caching for Low-Cost LLM Serving
Establishes the first rigorous framework for continuous semantic caching of LLM responses using ε-net discretization and kernel ridge regression, with sublinear regret bounds.
-
PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems
PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
-
Efficient Remote KV Cache Reuse with GPU-native Video Codec
KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
-
SGLang: Efficient Execution of Structured Language Model Programs
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
-
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
-
Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack
Workload-aware optimizations for LLM serving in AML and fraud detection yield substantial gains in throughput, latency, and GPU utilization on synthetic compliance prompts.