Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics

Chanwoo Moon; Heejin Kim; Jaeheon Lee; Jungwoo Kim; Minsang Kim; Sungjin Lee; Taeho Hwang; Woosuk Chung; Yeseong Kim

arxiv: 2508.18736 · v1 · pith:7ADWNATDnew · submitted 2025-08-26 · 💻 cs.DB · cs.LG

Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics

Jungwoo Kim , Minsang Kim , Jaeheon Lee , Chanwoo Moon , Heejin Kim , Taeho Hwang , Woosuk Chung , Yeseong Kim

show 1 more author

Sungjin Lee

This is my paper

classification 💻 cs.DB cs.LG

keywords cachingservingsisotraditionalcachesmemorysemanticstate-of-the-art

0 comments

read the original abstract

Serving Large Language Models (LLMs) at scale requires meeting strict Service Level Objectives (SLOs) under severe computational and memory constraints. Nevertheless, traditional caching strategies fall short: exact-matching and prefix caches neglect query semantics, while state-of-the-art semantic caches remain confined to traditional intuitions, offering little conceptual departure. Building on this, we present SISO, a semantic caching system that redefines efficiency for LLM serving. SISO introduces centroid-based caching to maximize coverage with minimal memory, locality-aware replacement to preserve high-value entries, and dynamic thresholding to balance accuracy and latency under varying workloads. Across diverse datasets, SISO delivers up to 1.71$\times$ higher hit ratios and consistently stronger SLO attainment compared to state-of-the-art systems.

This paper has not been read by Pith yet.

Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics

discussion (0)