{"total":10,"items":[{"citing_arxiv_id":"2606.13097","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied Agents","primary_cat":"cs.PL","submitted_at":"2026-06-11T09:25:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FCGraft synthesizes code policies for embodied agents by grafting KV caches from a library of validated functions, claiming 18.31% higher success rate and 2.3x faster synthesis than prompt-level caching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04302","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding","primary_cat":"cs.CL","submitted_at":"2026-06-03T00:12:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27494","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?","primary_cat":"cs.CR","submitted_at":"2026-05-26T16:50:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GroundedCache reduces unsafe-served rate in RAG answer caching to 0-1.5% (vs 15-51.5% naive) via four validation gates while keeping p50 latency within 1.07x of no-cache baseline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20630","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines","primary_cat":"cs.AI","submitted_at":"2026-05-20T02:30:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Temporal semantic caching and MCP workflow optimizations deliver 30.6x median speedup on cache hits and 1.67x overall speedup with 40% latency reduction on the AssetOpsBench industrial agent benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.09725","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient Remote KV Cache Reuse with GPU-native Video Codec","primary_cat":"cs.DC","submitted_at":"2026-02-10T12:29:02+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.17934","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM","primary_cat":"cs.CL","submitted_at":"2025-10-20T15:40:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AtlasKV integrates billion-scale KGs into LLMs parametrically with sub-linear complexity and low memory by converting triples into key-value representations handled by the model's attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.10129","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CacheClip: Accelerating RAG with Effective KV Cache Reuse","primary_cat":"cs.LG","submitted_at":"2025-10-11T09:28:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CacheClip accelerates RAG prefill by up to 3.33x via auxiliary-model-guided selective KV recomputation while retaining 85-91% of full-attention quality on NIAH and LongBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.15965","ref_index":131,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs","primary_cat":"cs.IR","submitted_at":"2025-04-22T15:05:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.","context_count":1,"top_context_role":"method","top_context_polarity":"background","context_text":"FastServe [112], StreamingLLM [113], Orca [114], DistServe [115], LLM.int8() [116], FastGen [117], Train Large, Then Compress [118], Scissorhands [119], H2O [120], Mooncake [121], MemServe [122], SLM Serving [123], IMPRESS [124], AdaServe [125], MPIC [126], IntelLLM [127] KV Reuse KV Cache [128], Prompt Cache [83], Contextual Retrieval [84], CacheGen [129], ChunkAttention [130], RAGCache [131], SGLang [132], Ada-KV [133], HCache [134], Cake [135], EPIC [136], RelayAttention [137], Marconi [138], IKS [139], FastCache [140], Cache-Craft [141], KVLink [142], RAGServe [143], BumbleBee [144] VIII System Parametric Long-Term Parametric Memory Structures Memorizing Transformer [145], Focused Transformer [146], MAC [147], MemoryLLM [148], WISE [149], LongMem [150], LM2 [151], Titans [152]"},{"citing_arxiv_id":"2412.03594","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching","primary_cat":"cs.CL","submitted_at":"2024-11-29T05:57:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BatchLLM achieves 1.3x-10.8x higher throughput than vLLM and SGLang for batched LLM inference with prefix sharing via global prefix identification, decoding-first reordering, and memory-centric token batching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.13193","ref_index":88,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Retrieval-Augmented Generation for Natural Language Processing: A Survey","primary_cat":"cs.CL","submitted_at":"2024-07-18T06:06:53+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The survey organizes RAG methods via a taxonomy of query-based, logits-based, latent, and parametric fusion with comparisons on accessibility, efficiency, applications, and challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"all candidates for the final top-k nearest neighbors. Retrieving values from datastore fetches the corresponding values based on the nearest key identifiers. 5 Retrieval fusions in RAG Query-based Fusions Logits-based Fusions Latent Fusions Text Concatenation Feature Concatenation REALM [53] RAG [95] REINA [161] RALM [133] FID [72] RETRO- PROMPT [16] LUMEN [27] Ensemble Calibration kNN-LM [88] kNN-MT [87] kNN-Adapter [68] Robust-kNN-MT [78] Source-Context [97] Attention Weighted Addition RETRO [11] Enc-Dec [101] LONGMEM [163] EAE [40] ReFusion [169] Figure 3: The categories of fusion methods in RAG. Algorithm 4.1 Query-based Fusions. Input: A query input𝑞, top-𝑘 nearest neighbor knowledge{𝑣1, . . . , 𝑣𝑘 }, an encoder E𝑓 and a decoder D𝑓 for feature concatenation, the generator"}],"limit":50,"offset":0}