arxiv: 2605.07443 · v1 · submitted 2026-05-08 · 💻 cs.DC

Recognition: no theorem link

RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching

Zhan Zhao , Yuxin Wang , Amelie Chi Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:03 UTC · model grok-4.3

classification 💻 cs.DC

keywords generative recommendationKV cachingLLM inferencedistributed servingtime-to-first-tokenprefix cachingselective attention

0 comments

The pith

RcLLM decomposes recommendation prompts into reusable blocks to cache KV states beyond contiguous prefixes and cut time-to-first-token by 1.31x to 9.51x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models turn recommendation into a generative task but face high latency from long personalized prompts that standard prefix caching cannot reuse efficiently. RcLLM introduces beyond-prefix KV caching that breaks prompts into reusable blocks, stores compact user histories in replicated caches for instant access, and shards large item caches with similarity-aware placement. An affinity scheduler improves locality while a selective attention step corrects approximation errors from non-contiguous reuse. On real-world datasets this yields large TTFT reductions while keeping recommendation accuracy essentially unchanged, opening the door to real-time generative serving at scale.

Core claim

RcLLM is a distributed inference system that replaces standard prefix KV caching with beyond-prefix caching: prompts are decomposed into reusable blocks, user-history caches are fully replicated for zero-latency retrieval, item caches are sharded by similarity, and an affinity-based global scheduler plus selective attention mechanism together eliminate most redundant quadratic attention computation while preserving output quality.

What carries the argument

Beyond-prefix KV caching, which decomposes prompts into reusable blocks and supports non-contiguous reuse through stratified storage and selective attention correction.

If this is right

Real-time generative recommendation becomes feasible at industrial scale because first-token latency drops enough for interactive use.
Memory and compute costs for serving large item catalogs fall because only relevant blocks are loaded and attention is pruned selectively.
The same decomposition approach can be applied to any workload whose prompts contain repeated non-contiguous segments.
Distributed serving systems gain a new caching layer that sits between pure prefix reuse and full recomputation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may extend naturally to multi-turn conversational recommendation where later turns share history blocks with earlier ones.
Cache hit-rate measurements on catalogs of varying sizes would quantify how the sharding strategy scales beyond the reported datasets.
If block boundaries are chosen by learned embeddings rather than fixed rules, the approach could adapt to new prompt styles without manual tuning.

Load-bearing premise

Prompts can be reliably broken into reusable blocks and the selective attention fix will correct any approximation errors without meaningfully harming recommendation quality.

What would settle it

Run the same recommendation task on a dataset where block decomposition produces frequent mismatches; if TTFT gains disappear or accuracy falls below the reported negligible threshold, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.07443 by Amelie Chi Zhou, Yuxin Wang, Zhan Zhao.

**Figure 1.** Figure 1: Prompt analysis for generative recommendation. (a) An example prompt combining user interaction history, candidate [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Standard Autoregressive Inference Process [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Analysis of token characteristics and attention mechanisms. (a) Visualizing token embeddings from 1-star and 5-star [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Item popularity distribution of three datasets First, for semantic review histories, the cache must satisfy three key requirements: (i) Compact, as user histories are included in every request and accessed frequently; (ii) Position-aware, since Transformer representations couple token semantics with positional information; (iii) Low-latency local access, to avoid cross-node communication on the critical … view at source ↗

**Figure 6.** Figure 6: TTFT CDF comparison in a distributed setting with [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Difference between the NDCG of RcLLM and that of Full-Recompute (the higher the better). RcLLM maintains ranking [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Normalized performance (speedup) of RcLLM compared to Prefix-Cache under different cluster sizes for Qwen3-8B (top) [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 11.** Figure 11: The latency cost of increased fidelity ( [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 10.** Figure 10: The impact of scheduling policy on latency under [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are transforming recommendation from ranking into a generative task, but industrial deployment remains limited by the high latency of processing long, personalized prompts. Standard prefix caching provides limited benefit because reuse in recommendation workloads is often non-contiguous across user histories and item contexts. We present RcLLM, a distributed inference system for generative recommendation with Beyond-Prefix KV Caching. RcLLM decomposes prompts into reusable blocks and supports large item catalogs with a stratified distributed storage design: compact user-history caches are replicated for zero-latency retrieval, while massive item caches are sharded using similarity-aware placement. To reduce redundant quadratic attention computation, RcLLM combines an affinity-based global scheduler that improves data locality with a selective attention mechanism that corrects approximation errors. Experiments on real-world datasets show that RcLLM reduces Time-To-First-Token (TTFT) by 1.31x-9.51x compared with state-of-the-art prefix caching systems, enabling real-time serving with negligible impact on recommendation accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RcLLM shows how to get substantial TTFT reductions in generative recommendation through block-based KV caching and selective fixes, though the accuracy side needs tighter validation.

read the letter

RcLLM tackles the latency problem in generative recommendation by moving past standard prefix KV caching. The key idea is to decompose prompts into reusable blocks, store user histories in compact replicated caches for fast access, shard the large item caches with similarity-aware placement, and use an affinity scheduler plus selective attention to cut redundant computation and fix approximation errors. What stands out is how they adapt caching to the non-contiguous reuse patterns typical in recommendation prompts. Standard prefix methods don't capture that well, so this block approach plus the distributed design feels like a practical step forward for industrial serving. The reported speedups of 1.31x-9.51x on real datasets suggest it could make real-time generative rec feasible. The experiments are the main support, and they claim negligible accuracy impact. That part is where things get softer. The selective attention is supposed to correct errors from the block approximations, but without ablations showing how much error is introduced and how well it's fixed, or tests on longer histories, it's not clear how robust this is. If the blocks don't align well, the correction might not suffice. There's no complex math or derivations here, just system design and measurements. The citation pattern looks fine for this area. This paper is aimed at people working on LLM serving systems for recommendation tasks. A reader interested in practical optimizations for low-latency inference would get value from the design choices and the performance numbers. It deserves a serious referee because the problem is real and the solution is implemented and tested, even if the evaluation could be tightened. I would recommend sending it to peer review.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces RcLLM, a distributed inference system for generative recommendation that uses beyond-prefix KV caching. Prompts are decomposed into reusable blocks with stratified storage (replicated user-history caches and sharded item caches), an affinity-based global scheduler for data locality, and a selective attention mechanism to correct approximation errors from non-contiguous reuse. Experiments on real-world datasets are reported to yield 1.31x-9.51x TTFT reductions versus state-of-the-art prefix caching systems while maintaining recommendation accuracy.

Significance. If the performance and accuracy claims are substantiated, the work would address a key barrier to industrial deployment of generative LLMs in recommendation by enabling real-time serving of long personalized prompts where standard prefix caching provides limited benefit due to non-contiguous reuse patterns.

major comments (3)

Abstract: The central claim of 1.31x-9.51x TTFT reduction with 'negligible impact on recommendation accuracy' is presented without any reference to experimental setup details, baselines, datasets, error bars, or statistical tests, leaving the load-bearing performance result with insufficient verifiable support.
Selective attention mechanism (described in abstract): The assertion that this mechanism reliably corrects approximation errors arising from prompt decomposition into reusable blocks lacks any ablation studies, error analysis, or bounds on when correction succeeds, particularly for long heterogeneous user histories; this directly underpins the 'negligible accuracy impact' claim.
Stratified distributed storage design (abstract): No quantitative evaluation or comparison is referenced for the similarity-aware sharding of massive item caches versus alternatives, making it impossible to assess whether this design is load-bearing for the reported TTFT gains.

minor comments (1)

Abstract: The range 1.31x-9.51x is stated without specifying the conditions or datasets under which the minimum and maximum are achieved.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The central claim of 1.31x-9.51x TTFT reduction with 'negligible impact on recommendation accuracy' is presented without any reference to experimental setup details, baselines, datasets, error bars, or statistical tests, leaving the load-bearing performance result with insufficient verifiable support.

Authors: We agree the abstract's brevity limits immediate verifiability. The full manuscript (Section 5) specifies real-world datasets, state-of-the-art prefix caching baselines, and averaged TTFT results across runs. We will revise the abstract to briefly reference the datasets, baselines, and note that improvements are consistent with low variance across multiple trials, while retaining conciseness. revision: partial
Referee: Selective attention mechanism (described in abstract): The assertion that this mechanism reliably corrects approximation errors arising from prompt decomposition into reusable blocks lacks any ablation studies, error analysis, or bounds on when correction succeeds, particularly for long heterogeneous user histories; this directly underpins the 'negligible accuracy impact' claim.

Authors: We acknowledge the current version lacks dedicated ablations and bounds. The manuscript describes the mechanism's design for correcting non-contiguous reuse errors, but we will add a new subsection with ablations isolating selective attention's accuracy impact over varying history lengths and heterogeneity. We will also include error analysis quantifying pre- and post-correction approximation errors and empirical bounds from experiments to substantiate the negligible accuracy claim. revision: yes
Referee: Stratified distributed storage design (abstract): No quantitative evaluation or comparison is referenced for the similarity-aware sharding of massive item caches versus alternatives, making it impossible to assess whether this design is load-bearing for the reported TTFT gains.

Authors: The stratified design (replicated user caches, similarity-aware sharded item caches) is evaluated end-to-end in the manuscript, but we agree specific comparisons are needed. We will add quantitative results comparing similarity-aware sharding against random and hash-based alternatives, showing effects on load balance, hit rates, and TTFT to demonstrate its contribution to the gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems claims rest on experiments

full rationale

The paper describes a distributed KV-caching system for generative recommendation with no mathematical derivation chain, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems. TTFT reductions and accuracy claims are presented as outcomes of reported experiments on real-world datasets rather than reductions to inputs by construction. Design elements such as block decomposition, affinity scheduling, and selective attention are introduced as engineering choices whose correctness is asserted via empirical measurement, not self-definition or prior-author ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no explicit free parameters, mathematical axioms, or invented physical entities; the contribution is a systems design rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5481 in / 1168 out tokens · 40646 ms · 2026-05-11T02:03:55.907892+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

[1]

Updlrm: Pim-based accelerator to address the memory bottleneck in dlrm inference,

S. Chen, H. Tan, A. C. Zhou, Y . Liet al., “Updlrm: Pim-based accelerator to address the memory bottleneck in dlrm inference,” in DAC’24, 2024, pp. 1–6

work page 2024
[2]

Near-zero-overhead fresh- ness for recommendation systems via inference-side model updates,

W. Yu, S. Chen, A. C. Zhou, and C. Chen, “Near-zero-overhead fresh- ness for recommendation systems via inference-side model updates,” in HPCA’26, 2026

work page 2026
[3]

Onerec technical report.arXiv preprint arXiv:2506.13695, 2025

G. Zhou, J. Deng, J. Zhang, K. Caiet al., “Onerec technical report,” arXiv:2506.13695, 2025

work page arXiv 2025
[4]

Gpt4rec: A generative frame- work for personalized recommendation and user interests interpretation,

J. Li, W. Zhang, T. Wang, G. Xionget al., “Gpt4rec: A generative frame- work for personalized recommendation and user interests interpretation,” arXiv:2304.03879, 2023

work page arXiv 2023
[5]

J. Beswick. (2021) Operating lambda: Performance optimization – part

work page 2021
[6]

[Online]

AWS Compute Blog. [Online]. Available: https://aws.amazon.com/ blogs/compute/operating-lambda-performance-optimization-part-1/

work page
[7]

Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference,

U. Gupta, S. Hsia, V . Saraph, X. Wanget al., “Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference,” in ISCA’20, 2020

work page 2020
[8]

Minference 1.0: Accelerat- ing pre-filling for long-context llms via dynamic sparse attention,

H. Jiang, Y . Li, C. Zhang, Q. Wuet al., “Minference 1.0: Accelerat- ing pre-filling for long-context llms via dynamic sparse attention,” in NeurIPS’24, 2024

work page 2024
[9]

Transformers: State- of-the-art natural language processing,

T. Wolf, L. Debut, V . Sanh, J. Chaumondet al., “Transformers: State- of-the-art natural language processing,” inEMNLP’20, 2020

work page 2020
[10]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Shenget al., “Efficient memory management for large language model serving with pagedattention,” in SOSP’23, 2023

work page 2023
[11]

Sglang: efficient execution of structured language model programs,

L. Zheng, L. Yin, Z. Xie, C. Sunet al., “Sglang: efficient execution of structured language model programs,” inNeurIPS’24, 2024

work page 2024
[12]

arXiv preprint arXiv:2409.12740 , year=

J. Chen, L. Chi, B. Peng, and Z. Yuan, “Hllm: Enhancing sequential recommendations via hierarchical large language models for item and user modeling,”arXiv:2409.12740, 2024

work page arXiv 2024
[13]

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering,

R. He and J. McAuley, “Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering,” inWWW ’16, 2016

work page 2016
[14]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandeyet al., “The llama 3 herd of models,”arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhanget al., “Qwen3 technical report,” arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Preble: Efficient distributed prompt scheduling for LLM serving,

V . Srivatsa, Z. He, R. Abhyankar, D. Liet al., “Preble: Efficient distributed prompt scheduling for LLM serving,” inICLR’25, 2025

work page 2025
[17]

A sur- vey on locality sensitive hashing algorithms and their applications,

O. Jafari, P. Maurya, P. Nagarkar, K. M. Islamet al., “A sur- vey on locality sensitive hashing algorithms and their applications,” arXiv:2102.08942, 2021

work page arXiv 2021
[18]

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

Y . Hou, J. Li, Z. He, A. Yanet al., “Bridging language and items for retrieval and recommendation,”arXiv:2403.03952, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Y . Inc. (2025) Yelp open dataset. Yelp Open Dataset. [Online]. Available: https://business.yelp.com/data/resources/open-dataset/

work page 2025
[20]

Item recommendation on monotonic behavior chains,

M. Wan and J. McAuley, “Item recommendation on monotonic behavior chains,” inRecSys’18, 2018

work page 2018
[21]

Multilevelk-way partitioning scheme for irregular graphs,

G. Karypis and V . Kumar, “Multilevelk-way partitioning scheme for irregular graphs,”Journal of Parallel and Distributed computing, 1998

work page 1998
[22]

Aibrix: Towards scalable, cost-effective large language model inference infrastructure,

T. A. Team, J. Shan, V . Gupta, L. Xuet al., “Aibrix: Towards scalable, cost-effective large language model inference infrastructure,” arXiv:2504.03648, 2025

work page arXiv 2025
[23]

Zero-copy i/o processing for low- latency gpu computing,

S. Kato, J. Aumiller, and S. Brandt, “Zero-copy i/o processing for low- latency gpu computing,” inICCPS’13, 2013

work page 2013
[24]

Semsharekv: Efficient kvcache sharing for semantically similar prompts via token-level lsh matching,

X. Zhao and S. Mastorakis, “Semsharekv: Efficient kvcache sharing for semantically similar prompts via token-level lsh matching,” inAACL- IJCNLP’25, 2025

work page 2025
[25]

Cacheblend: Fast large language model serving for rag with cached knowledge fusion,

J. Yao, H. Li, Y . Liu, S. Rayet al., “Cacheblend: Fast large language model serving for rag with cached knowledge fusion,” inEuroSys ’25, 2025

work page 2025
[26]

Roformer: Enhanced transformer with rotary position embedding,

J. Su, M. Ahmed, Y . Lu, S. Panet al., “Roformer: Enhanced transformer with rotary position embedding,”Neurocomputing, 2024

work page 2024
[27]

Vidur: A large-scale simulation framework for llm inference,

A. Agrawal, N. Kedia, J. Mohan, A. Panwaret al., “Vidur: A large-scale simulation framework for llm inference,” inMLSYS’24, 2024

work page 2024
[28]

Qwen2 Technical Report

A. Yang, B. Yang, B. Hui, B. Zhenget al., “Qwen2 technical report,” arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Agentsociety challenge: Designing llm agents for user modeling and recommendation on web platforms,

Y . Yan, Y . Shang, Q. Zeng, Y . Liet al., “Agentsociety challenge: Designing llm agents for user modeling and recommendation on web platforms,” inWWW’25, 2025

work page 2025
[30]

EPIC: Efficient position- independent caching for serving large language models,

J. Hu, W. Huang, W. Wang, H. Wanget al., “EPIC: Efficient position- independent caching for serving large language models,” inICML ’25, 2025

work page 2025
[31]

arXiv preprint arXiv:2304.03153 , year=

L. Wang and E.-P. Lim, “Zero-shot next-item recommendation using large pretrained language models,”arXiv:2304.03153, 2023

work page arXiv 2023
[32]

Large language models are zero-shot rankers for recommender systems,

Y . Hou, J. Zhang, Z. Lin, H. Luet al., “Large language models are zero-shot rankers for recommender systems,” inECIR’24, 2024

work page 2024
[33]

Tapping the potential of large language models as recommender systems: A comprehensive framework and empirical analysis,

L. Xu, J. Zhang, B. Li, J. Wanget al., “Tapping the potential of large language models as recommender systems: A comprehensive framework and empirical analysis,”ACM TKDD, 2025

work page 2025
[34]

Star: A simple training- free approach for recommendations using large language models,

D.-H. Lee, A. Kraft, L. Jin, N. Mehtaet al., “Star: A simple training- free approach for recommendations using large language models,” arXiv:2410.16458, 2025

work page arXiv 2025
[35]

Llamarec: Two-stage recommendation using large language models for ranking, 2023

Z. Yue, S. Rabhi, G. de Souza Pereira Moreira, D. Wanget al., “Llamarec: Two-stage recommendation using large language models for ranking,”arXiv:2311.02089, 2023

work page arXiv 2023
[36]

Drdt: Dynamic reflection with divergent thinking for llm-based sequential recommendation,

Y . Wang, Z. Liu, J. Zhang, W. Yaoet al., “Drdt: Dynamic reflection with divergent thinking for llm-based sequential recommendation,” arXiv:2312.11336, 2023

work page arXiv 2023
[37]

arXiv preprint arXiv:2303.14524 , year=

Y . Gao, T. Sheng, Y . Xiang, Y . Xionget al., “Chat-rec: Towards interactive and explainable llms-augmented recommender system,” arXiv:2303.14524, 2023

work page arXiv 2023
[38]

Genrec: Large language model for generative recommendation,

J. Ji, Z. Li, S. Xu, W. Huaet al., “Genrec: Large language model for generative recommendation,” inECIR’24, 2024

work page 2024
[39]

Let me do it for you: Towards llm empowered recommendation via tool learning,

Y . Zhao, J. Wu, X. Wang, W. Tanget al., “Let me do it for you: Towards llm empowered recommendation via tool learning,” inSIGIR’24, 2024

work page 2024
[40]

RecMind: Large language model powered agent for recommendation,

Y . Wang, Z. Jiang, Z. Chen, F. Yanget al., “RecMind: Large language model powered agent for recommendation,” inNAACL’24, 2024

work page 2024
[41]

Agentcf: Collaborative learning with autonomous language agents for recommender systems,

J. Zhang, Y . Hou, R. Xie, W. Sunet al., “Agentcf: Collaborative learning with autonomous language agents for recommender systems,” inWWW’24, 2024

work page 2024
[42]

On generative agents in recommendation,

A. Zhang, Y . Chen, L. Sheng, X. Wanget al., “On generative agents in recommendation,” inSIGIR’24, 2024

work page 2024
[43]

Llamarec-lkg-rag: A single-pass, learnable knowledge graph-rag framework for llm-based ranking,

V . Azizi and F. Koochaki, “Llamarec-lkg-rag: A single-pass, learnable knowledge graph-rag framework for llm-based ranking,” arXiv:2506.07449, 2025

work page arXiv 2025
[44]

Actions speak louder than words: trillion-parameter sequential transducers for generative recom- mendations,

J. Zhai, L. Liao, X. Liu, Y . Wanget al., “Actions speak louder than words: trillion-parameter sequential transducers for generative recom- mendations,” inICML’24, 2024

work page 2024
[45]

Bat: Efficient generative recommender serving with bipartite attention,

J. Sun, S. Wang, Z. Zhang, Z. Liuet al., “Bat: Efficient generative recommender serving with bipartite attention,” inASPLOS’26, 2026

work page 2026