{"total":14,"items":[{"citing_arxiv_id":"2606.24033","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RoPE-Aware Bit Allocation for KV-Cache Quantization","primary_cat":"cs.LG","submitted_at":"2026-06-23T00:17:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Block-GTQ performs RoPE-aware greedy bit allocation on KV caches using per-block energy scores, cutting logit MAE 32-80% versus uniform TQ-MSE and lifting long-context task scores substantially at 2-3 bits per dimension.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19667","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference","primary_cat":"cs.CL","submitted_at":"2026-06-18T00:38:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"CacheWeaver is a lightweight scheduling layer that orders evidence to exploit prefix caching, reducing median TTFT by 20-33% across vLLM setups while preserving answer quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02057","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Waiting at the front door: Continuous monitoring of latency in the host network stack","primary_cat":"cs.NI","submitted_at":"2026-06-01T10:46:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"netstacklat is a new low-overhead monitoring tool that records host network stack latency from early kernel processing to application delivery and was tested on 144 HTTP workload variants plus a Cloudflare deployment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01065","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Leyline: KV Cache Directives for Agentic Inference","primary_cat":"cs.DC","submitted_at":"2026-05-31T07:13:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Leyline adds a policy-directed KV cache edit primitive with closed-form RoPE correction for agentic inference, reporting +11.2 pp cache-hit lift and +14.3 pp solve-rate gain.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30218","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference","primary_cat":"cs.LG","submitted_at":"2026-05-28T16:50:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MarginGate triggers verification only on low-margin decode steps to achieve 100% deterministic batch inference at 15-50% of the cost of always-on verification across tested models and datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26845","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Birkhoff Decompositions and Photonic Interconnects Wait! Don't Forget the Compute!","primary_cat":"cs.NI","submitted_at":"2026-05-26T10:59:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A greedy max-weight decomposition strategy for MoE all-to-all communication on photonic fabrics improves overlap efficiency and reduces compute overheads compared to BvN by bounding the number of matchings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09735","ref_index":27,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving","primary_cat":"cs.AR","submitted_at":"2026-05-10T20:10:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"KV-RM regularizes KV-cache movement via block paging and coalesced transfers to improve throughput, tail latency, and memory efficiency in static-graph LLM serving without changing the decoder interface.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Across the GPU characterization points in Figure 1(a), the static-graph path retains a substantially larger after-idle process-resident footprint than a paged runtime, leaving ma- terially less multiplexing headroom. Under production-trace replay, our static-graph baseline also exhibits burst-time la- tency spikes and lower replay-window throughput over the same window [27, 29]. Taken together, these observations point to two missing capabilities under static-graph replay: active-set tracking and burst-robust transport. Existing systems usually resolve this tension at one of two boundaries. Dynamic runtimes host flexibility inside the runtime and attention path through block-level paging and stepwise scheduling, thereby avoiding worst-case padding"},{"citing_arxiv_id":"2605.00528","ref_index":41,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters","primary_cat":"cs.DC","submitted_at":"2026-05-01T09:05:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SAGA introduces workflow-atomic scheduling for compound AI agents, achieving 1.64x lower task completion time and 1.22x better memory utilization than vLLM on a 64-GPU cluster at the cost of 30% lower peak throughput.","context_count":1,"top_context_role":"extension","top_context_polarity":"extend","context_text":"Continuum [30] introduce workflow-aware eviction; Llumnix [59] enables live KV migration and SOLA [26] optimizes SLO attainment, but neither targets workflow-level scheduling. SAGA's distinctive position (Table 11) is to unify these dimensions under the empirical competitive-ratio bound that quantifies the limit of workflow-aware online cache management. Fairness and caching.DRF [ 21], VTC [55], and Themis [41] ad- dress resource fairness; SAGA extends to task-completion fairness. Our WA-LRU achieves1 .31× competitive ratio against Bélády's optimal [4]. 11 Conclusions and Future Work We presented SAGA, a distributed scheduler for multi-step AI agent workloads that treats agent programs as first-class schedulable units. By adapting three classical systems principles (workflow scheduling,"},{"citing_arxiv_id":"2604.15261","ref_index":29,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RNG: Flat Datacenter Networks at Scale","primary_cat":"cs.NI","submitted_at":"2026-04-16T17:37:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RNG deploys the first production flat datacenter network using quasi-random graphs, a new distributed routing protocol, and a passive optical cabling shuffle device, achieving fat-tree performance at substantially lower cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07857","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey","primary_cat":"eess.SY","submitted_at":"2026-04-09T06:13:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper surveys energy efficiency strategies for Agentic AI inference by proposing a new accounting framework and taxonomy that spans model simplification, computation control, input optimization, and cross-layer co-design with wireless networks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[78] Reuses KV caches across semanti- cally similar tasks via LSH (Sim- LLM). Accelerates inference in multi- node edge deployments. KV cache reuse is contingent on high semantic similarity between tasks. [135] Retains only \"Heavy Hitter\" tokens in memory (H2O). Reduces footprint by 5 ×; enables longer sequence generation. Static ratios fail on unconventional text distributions. [75] Asymmetric quantization: 2-bit value, 4-bit key (KIVI). Reduces memory by 2.6 ×; sup- ports 64K tokens on mobile. May lose precision in complex long-context reasoning. [97] Loads only critical KV cache pages based on query vectors (Quest). > 2× speedup in self-attention; lowers I/O energy. Accuracy relies on effective Top-K critical page selection."},{"citing_arxiv_id":"2604.06370","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache","primary_cat":"cs.DC","submitted_at":"2026-04-07T18:52:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"spired by these approaches, ForkKV adapts the CoW mechanism to manage the KV cache for highly-branched shared contexts across agents, effectively extending this paradigm to multi-LoRA agent serving scenarios. KV Cache Optimization.Existing KV cache optimization strate- gies primarily focus on lossless memory layout improvements [48, 49, 62, 63, 74], lossy compression [ 37, 69], and cross-chunk or cross-model sharing [19, 36, 67]. In concurrent work, LRAgent [23] proposes to decompose the KV cache into shared and adapter- dependent components for multi-LoRA agent serving with neg- ligible accuracy loss. Distinct from these approaches, our work uniquely introduces an OS-inspired DualRadixTree for decoupled cache management and an efficient attention kernel fused with"},{"citing_arxiv_id":"2603.10726","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems","primary_cat":"cs.CR","submitted_at":"2026-03-11T12:59:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.00868","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management","primary_cat":"cs.LG","submitted_at":"2025-11-02T09:33:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlexiCache reduces GPU memory for long-context LLM requests by up to 70% and boosts throughput 1.38-1.55x and latency 1.6-2.1x by exploiting per-head differences in temporal stability of critical tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.13171","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Compressed Chain of Thought: Efficient Reasoning Through Dense Representations","primary_cat":"cs.CL","submitted_at":"2024-12-17T18:50:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CCoT generates variable-length continuous contemplation tokens that compress explicit reasoning chains, enabling additional dense reasoning and accuracy gains in off-the-shelf language models while allowing adaptive control of token count.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}