{"total":36,"items":[{"citing_arxiv_id":"2606.31519","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RaBitQCache: Rotated Binary Quantization for KVCache in Long Context LLM Inference","primary_cat":"cs.LG","submitted_at":"2026-06-30T11:32:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RaBitQCache proposes rotated binary quantization with binary-INT4 arithmetic for unbiased attention weight estimation in long-context LLMs, enabling adaptive Top-p retrieval and hardware optimizations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30389","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Predict, Reuse, and Repair: Accelerating Dynamic Sparse Attention for Long-Context LLM Decoding","primary_cat":"cs.LG","submitted_at":"2026-06-29T14:43:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PRR accelerates dynamic sparse attention decoding in long-context LLMs via EMA-based prediction, speculative attention, and FlashAttention repair, achieving up to 40% latency reduction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28831","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HARD-KV: Head-Adaptive Regularization for Decoding-time KV Compression","primary_cat":"cs.LG","submitted_at":"2026-06-27T09:36:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HARD-KV bridges dynamic head-adaptive KV cache compression with static inference engine constraints via Cascade Cache and Logits Calibration, reporting up to 2x throughput gains on long-context math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06302","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving","primary_cat":"cs.LG","submitted_at":"2026-06-04T15:41:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Tangram makes non-uniform KV cache compression practical for LLM serving with deterministic budget allocation, head group paging, and ahead-of-time load balancing, achieving up to 2.6x throughput gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01927","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling LLM Inference Beyond Amdahl`s Limits via Eliminating Non-Scalable Overheads","primary_cat":"cs.DC","submitted_at":"2026-06-01T08:58:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Albireo overlaps non-scalable overheads with compute in tensor-parallel LLM inference to raise the empirical optimal TP degree, delivering up to 1.9x throughput and 48% lower latency versus vLLM.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01502","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics","primary_cat":"cs.DC","submitted_at":"2026-05-31T23:53:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"On a real multi-node H100 cluster the authors show that for MLA, routing the ~1 KB compressed query row is cheaper than moving cache chunks and supply a topology-aware cost model accurate to ~7% on IBGDA fabrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22416","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference","primary_cat":"cs.LG","submitted_at":"2026-05-21T12:37:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18071","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference","primary_cat":"cs.CL","submitted_at":"2026-05-18T08:54:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KVDrive introduces a multi-tier KV cache management system that achieves up to 1.74x higher throughput for long-context LLM inference through adaptive cache placement, pipeline restructuring, and cross-tier coordination while preserving accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17633","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SparseSAM: Structured Sparsification of Activations in Segment Anything Models","primary_cat":"cs.CV","submitted_at":"2026-05-17T19:54:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18856","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference","primary_cat":"cs.LG","submitted_at":"2026-05-13T18:48:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spherical KV combines angle-domain attention using spherical key codes with rate-distortion retention to cut KV cache residency and HBM traffic while keeping a paged, fusion-friendly decode path.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10195","ref_index":65,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration","primary_cat":"cs.LG","submitted_at":"2026-05-11T08:45:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"vLLM [27] introduces PageAttention, which decouples key-value caching from attention computa- tion. SGLang [71] implements RadixAttention to reuse inter- mediate KV-cache memory across requests. FlashAttention and FlashDecoding utilize the online softmax to reduce mem- ory usage [15]. FlashInfer integrates the above optimizations of attention into a unified block-sparse framework [65]. To- gether, these efforts demonstrate the effectiveness of system- level optimization for general LLM decoding. On the algorithmic side, efficient reasoning methods pri- marily target token efficiency in CoT. Approaches reduce the number of generated tokens or adaptively adjust reason- ing length based on task complexity [7, 9, 17, 35, 36, 56, 70]."},{"citing_arxiv_id":"2605.08524","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP","primary_cat":"cs.DC","submitted_at":"2026-05-08T22:16:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08467","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-08T20:35:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Zheng, Pengfei Guan, Paul Erhart, Jian Sun, Wengen Ouyang, Yanjing Su, and Zheyong Fan. Gpumd 4.0: A high-performance molecular dynamics package for versatile materials simulations with machine-learned potentials.Materials Genome Engineering Advances, 3(3): e70028, 2025. doi: https://doi.org/10.1002/mgea.70028. URL https://onlinelibrary. wiley.com/doi/abs/10.1002/mgea.70028. 11 [26] Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer: Efficient and customizable attention engine for llm inference serving, 2025. URL https://arxiv.org/ abs/2501.01005. [27] Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao."},{"citing_arxiv_id":"2605.07985","ref_index":40,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation","primary_cat":"cs.DC","submitted_at":"2026-05-08T16:44:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dooly reduces LLM inference profiling GPU-hours by 56.4% across 12 models while keeping simulation MAPE under 5% for TTFT and 8% for TPOT by making profiling configuration-agnostic and redundancy-aware.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"5-1m technical report, 2025. URLhttps://arxiv.org/abs/2501.15383. [39] Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer: Efficient and customizable attention engine for llm inference serving, 2025. URL https://arxiv.org/ abs/2501.01005. [40] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024. URL https://arxiv.org/abs/2312.07104. [41] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao"},{"citing_arxiv_id":"2605.06068","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?","primary_cat":"cs.AI","submitted_at":"2026-05-07T11:54:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"suggest a different point in the design space for infrastructure software: generation- time specialization rather than runtime generality. Code is available at https: //github.com/uw-syfi/vibe-serve. 1 Introduction LLM serving systems are critical software infrastructure for an economy increasingly dependent on generative AI. Open-source stacks such as vLLM [36], SGLang [80], and TensorRT-LLM [52] provide efficient abstractions across a broad range of models and hardware. Yet their designs are shaped primarily by mainstream deployments, such as decoder-only Transformers on NVIDIA GPUs serving generic chat workloads. As a result, emerging model families (e.g., multimodal models or hybrid state-space architectures), along with new hardware accelerators and atypical workloads,"},{"citing_arxiv_id":"2605.00616","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling","primary_cat":"cs.DC","submitted_at":"2026-05-01T12:35:21+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM-Emu is a serving-native emulator for vLLM that replaces GPU execution with profile-driven latency sampling and achieves under 5% error on TPOT, ITL, E2E latency, and throughput across multiple models, GPUs, and workloads.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27476","ref_index":13,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EdgeFM: Efficient Edge Inference for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-30T06:18:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EdgeFM is an agent-driven VLM inference framework achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin and first end-to-end deployment on Horizon Journey platform.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26039","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-04-28T18:20:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19241","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training","primary_cat":"cs.DC","submitted_at":"2026-04-21T08:49:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"of production-grade training, specifically addressing the inefficiencies inherent in current separate-kernel execution models. Limitation of state-of-the-art approaches.The community has proposed diverse strategies to mitigate EP bottlenecks, including dynamic load balancing (e.g., MegaBlocks [15]), model compression (DeepSpeed- MoE [32]), heterogeneous offloading (HeterMoE [41]), and communication-computation overlap (MegaScale- MoE [18], COMET [46]). Specialized communication libraries like DeepEP [48] have also been developed to accelerate communication primitives. Production frameworks such as Megatron-LM [23] now integrate these advancements, combining Transformer Engine [26] with optimized kernels [15, 48] to push performance limits. Despite these strides, we identify two critical deficiencies in state-of-the-art work."},{"citing_arxiv_id":"2604.18348","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-20T14:43:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"For the key tokens, we show that the distribution of the key tokens varies greatly across dif- ferent layers. Due to this, we propose an layerwise adap- tive kmeans clustering method, covering cluster number assignment, threshold-wise adaptive clustering, and effi- cient critical cluster selection. 3.Evaluation.We build custom operators for AdaCluster on Triton [36] and FlashInfer [54], and test it on open- source DiT models including CogVideoX-2B [52], Hun- yuanVideo [16], and Wan-2.1 [39]. Experimental results on one A40 GPU demonstrate that AdaCluster achieves an end-to-end acceleration of1.67×-4.31×for resolu- tions above 720p within our test range, while maintain- ing high visual fidelity with PSNR of 30.99. 2. Background and Motivation"},{"citing_arxiv_id":"2604.14825","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels","primary_cat":"cs.PL","submitted_at":"2026-04-16T09:55:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"3 (Dao-AILab's CUT- LASS implementation) [12], and FlashInfer 0.6.7 [45]. Each baseline supports a subset of the setups we evaluate. For Py- Torch and FlexAttn, we usecompile(mode='max-autotune', backend='inductor') for high performance. For Tawa and TileLang, we use implementations following the FlashAttention- 3 strategy. We do not evaluate FlashAttention-4 [47] because it includes approximation techniques that are outside the scope of our exact-attention evaluation. Profiling and Performance Reporting.We use the CUDA Events API to measure the end-to-end inference la- tency of each model, and we report the throughput in TFLOPs per second. For the throughput, following [21], we calculate the theoretical floating-point operations (FLOPs) required"},{"citing_arxiv_id":"2604.14141","ref_index":95,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Geometric Context Transformer for Streaming 3D Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-04-15T17:58:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00831","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving","primary_cat":"cs.DC","submitted_at":"2026-03-26T13:27:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.18636","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Attention Sparsity is Input-Stable: Training-Free Sparse Attention for Video Generation via Offline Sparsity Profiling and Online QK Co-Clustering","primary_cat":"cs.CV","submitted_at":"2026-03-19T09:00:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Attention sparsity in video DiTs is an input-stable layer-wise property, enabling offline profiling and online bidirectional QK co-clustering for up to 1.93x speedup with PSNR up to 29 dB.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09558","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination","primary_cat":"cs.DC","submitted_at":"2026-02-11T06:23:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VTC eliminates unnecessary data movement in DNN compilation using virtual tensors tracked by index mappings, achieving up to 1.93x speedup and 60% memory savings on NVIDIA GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.14910","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction","primary_cat":"cs.PF","submitted_at":"2026-01-21T11:47:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.13684","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference","primary_cat":"cs.CL","submitted_at":"2026-01-20T07:35:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HeteroCache dynamically allocates KV cache space to attention heads based on their temporal stability and uses hierarchical asynchronous retrieval to achieve state-of-the-art long-context performance with up to 3x faster decoding at 224K context length.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.19179","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing","primary_cat":"cs.DC","submitted_at":"2025-12-22T09:13:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CascadeInfer partitions LLM instances into length-specialized groups, uses dynamic programming for stage partitioning, and applies runtime refinement plus decentralized load balancing to cut latency and raise throughput.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12087","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding","primary_cat":"cs.CL","submitted_at":"2025-12-12T23:30:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.02043","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants","primary_cat":"cs.LG","submitted_at":"2025-11-03T20:25:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.18245","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs","primary_cat":"cs.LG","submitted_at":"2025-10-21T03:08:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.09883","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning","primary_cat":"cs.CL","submitted_at":"2025-10-10T21:37:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and achieving 1.54x speedup.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.08726","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs","primary_cat":"cs.PL","submitted_at":"2025-10-09T18:33:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Neptune introduces dependency-breaking fusion with algebraic corrections for reduction sequences, generating FlashAttention-like kernels from plain attention code with 1.35x average speedup across ten benchmarks and four GPU architectures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.02922","ref_index":115,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference","primary_cat":"cs.LG","submitted_at":"2025-05-05T18:01:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RetroInfer introduces the wave index and wave buffer to realize sparse KV-cache attention for long-context LLM inference with up to 4.4X throughput gains while matching full-attention accuracy.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"tance, rather than relying on fixed positions in the context. Challenges.The fundamental trade-off between accuracy and retrieval cost in ANNS persists and becomes more pronounced when applied to sparsity-based KV caching. Existing vector indexes are inadequate to address the accuracy challenge due to the high variability in attention sparsity (Figure 4). As in prior work [115], the retrieval cost remains substantial relative to the limited PCIe bandwidth, because it must account for this variability to retrieve more tokens for desired accuracy. Moreover, efficiency challenges arise when using a vector index for sparsity-based KV cache, as it introduces index traversal and selective data access into an inference system optimized for dense"},{"citing_arxiv_id":"2502.01068","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration","primary_cat":"cs.LG","submitted_at":"2025-02-03T05:25:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FastKV decouples prefill context reduction via Token-Selective Propagation from independent KV cache selection, delivering up to 1.82x prefill and 2.87x decoding speedups while matching decoding-only accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.03594","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching","primary_cat":"cs.CL","submitted_at":"2024-11-29T05:57:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BatchLLM achieves 1.3x-10.8x higher throughput than vLLM and SGLang for batched LLM inference with prefix sharing via global prefix identification, decoding-first reordering, and memory-centric token batching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}