{"total":25,"items":[{"citing_arxiv_id":"2606.29207","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"KernelFlume: Elastic Core-Attention Scaling for Agentic Long-Context Decoding","primary_cat":"cs.DC","submitted_at":"2026-06-28T05:16:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"KernelFlume presents a disaggregated decode architecture that separates core attention from projection/FFN paths to enable elastic scaling of attention nodes, reporting up to 61% lower cost per million tokens versus full-instance scaling on H100 hardware for Llama-3.1-8B under dynamic long-context w","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25954","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-25T15:29:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Step-TP is a dataset providing grounded, atomic step-level IR transitions and CoT supervision to enable reliable multi-step LLM-guided tensor program optimization instead of end-to-end imitation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25306","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"QFlash: Bridging Quantization and Memory Efficiency in Vision Transformer Attention","primary_cat":"cs.LG","submitted_at":"2026-04-28T07:13:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"QFlash implements end-to-end integer FlashAttention with integer-only softmax, delivering up to 8.69x speedup and 18.8% energy savings on ViT models while preserving accuracy under per-tensor quantization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24715","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling","primary_cat":"cs.CL","submitted_at":"2026-04-27T17:23:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"3 Ablation Studies Comparison with position interpolation.To reduce the computational cost of model upcycling, we train models with shorter context lengths and then apply zero-shot context length extension. Specifically, we train models at different sequence lengths while keeping the total training token budget constant, and then apply YaRN position interpolation [33] to the RoPE embeddings in MLA layers to extend the context length. Mamba layers, which do not use positional embeddings, remain unchanged. In Figure 3, we evaluate performance on both short- and long-context tasks. Applying YaRN slightly reduces short-context accuracy but significantly improves long-context performance. For example, the 1B-4MLA-12M2 model trained with an 8K context achieves 50."},{"citing_arxiv_id":"2604.24820","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding","primary_cat":"cs.AR","submitted_at":"2026-04-27T14:06:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(adopted by Energon[72] and Sanger[31]) as an example, data load- ing volume of filter stage is2 × that of attention when sparsity is 95%. This results in severe power costs. At a sequence length of 8K, filter stage already accounts for more than 60% of total power, yield- ing only 25% power savings compared with dense attention[ 50]. Second, sparse pattern selection introduces significant latency[52] [36] [72]. Identifying sparsity relies on extracting Top-K elements from approximation score, which has𝑂(𝑛log𝑘) complexity[33]. In SCS, 𝑛 and 𝑘 are small, so time overhead is not significant. How- ever, in LCS, 𝑛 and 𝑘 increase substantially. Sorting performance deteriorates sharply, becoming critical path of end-to-end latency. Third, data supply becomes bottleneck."},{"citing_arxiv_id":"2604.23798","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers","primary_cat":"cs.LG","submitted_at":"2026-04-26T16:41:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ELSA casts online softmax attention as a prefix scan over monoid (m,S,W) to deliver exact FP32 semantics, O(n) memory, O(log n) depth, and Tensor-Core independence as a drop-in kernel.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23466","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs","primary_cat":"cs.LG","submitted_at":"2026-04-25T23:13:47+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16957","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon","primary_cat":"cs.LG","submitted_at":"2026-04-18T10:39:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15944","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CIMple: Standard-cell SRAM-based CIM with LUT-based split softmax for attention acceleration","primary_cat":"cs.AR","submitted_at":"2026-04-17T11:03:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CIMple delivers a 32 kb digital SRAM-based compute-in-memory accelerator for transformer self-attention that reaches 26.1 TOPS/W at 0.85 V in 28 nm with INT8 precision using dual-banked architecture and LUT-based split softmax.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18616","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants","primary_cat":"cs.DC","submitted_at":"2026-04-16T15:49:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Argus generates GPU kernels achieving 99-104% of hand-optimized throughput on key LLM kernels by enforcing compile-time data-flow invariants via a tag-based DSL and an in-context RL planner.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12798","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation","primary_cat":"cs.LG","submitted_at":"2026-04-14T14:28:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03446","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fast Cross-Operator Optimization of Attention Dataflow","primary_cat":"cs.AR","submitted_at":"2026-04-03T20:37:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"taining partial sums (psums) cannot be forwarded from the producer to the consumer. In the valid dataflow (left), the partial sum ofc1produced at stage 1⃝cannot be consumed at stage 3⃝; only the fully accumulated tile after stage 2⃝ is allowed. However, the right dataflow violates this con- straint by consumingc1at stage 2⃝. This restriction follows FlashAttention [19], [53], where each intermediate tile must be fully accumulated before the online softmax. By enforcing this constraint, we can systematically generate any valid at- tention fusion dataflow without violating producer-consumer dependencies, while ensuring functional correctness. Recomputation.Recomputation enables more flexible dataflows and shorter reuse distances, which in turn reduces"},{"citing_arxiv_id":"2604.03425","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems","primary_cat":"cs.CR","submitted_at":"2026-04-03T19:47:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"08-1.48× for short inputs but both fail beyond 512 tokens, and HEBooster remains below1×throughout. 6 Related Works Ciphertext Packing and Application Level Optimizations. Packing schemes have been extensively studied across diverse HE- based encrypted inference workloads, including CNNs [ 4, 5, 34], GNNs [33, 53, 54], and recent works on Transformers [49, 64]. These approaches optimize encrypted Transformer inference at the ap- plication level, including polynomial approximation for nonlinear functions and algorithmic improvements to encrypted matrix mul- tiplication. These techniques are orthogonal to AEGIS , which in- stead targets cross-operator multi-GPU orchestration under cipher- text/RNS coupling."},{"citing_arxiv_id":"2603.15031","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Attention Residuals","primary_cat":"cs.CL","submitted_at":"2026-03-16T09:32:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.22575","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"S2O: Early Stopping for Sparse Attention via Online Permutation","primary_cat":"cs.LG","submitted_at":"2026-02-26T03:30:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"S2O uses online permutation and importance-based early stopping to increase effective sparsity in attention, delivering 7.51x attention and 3.81x end-to-end speedups on Llama-3.1-8B at 128K context with preserved accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.18196","ref_index":23,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference","primary_cat":"cs.LG","submitted_at":"2026-02-20T13:09:49+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.03067","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FlashSinkhorn: IO-Aware Entropic Optimal Transport on GPU","primary_cat":"cs.LG","submitted_at":"2026-02-03T03:52:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FlashSinkhorn delivers up to 32x forward and 161x end-to-end speedups for entropic OT on A100 GPUs via IO-aware Triton kernels that fuse log-domain updates and streaming transport application.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01219","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights","primary_cat":"cs.LG","submitted_at":"2026-02-01T13:21:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MiTA makes attention scalable by gathering query-aware top-k key-value pairs through landmarks as deformable routed experts and compressing the N-width fast-weight MLP into a shared narrower expert.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.02043","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants","primary_cat":"cs.LG","submitted_at":"2025-11-03T20:25:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.09682","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Faster and Memory-Efficient Training of Sequential Recommendation Models for Large Catalogs","primary_cat":"cs.IR","submitted_at":"2025-08-13T15:03:38+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CCE- is a Triton kernel implementation of cross-entropy loss with negative sampling that reduces memory by more than 10x and accelerates training by up to 2x for large-catalog sequential recommenders.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.23884","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Test-Time Training Done Right","primary_cat":"cs.LG","submitted_at":"2025-05-29T17:50:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Large-chunk online updates during inference let test-time training scale state capacity to 40% of model size and handle contexts up to 1M tokens without custom kernels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.03594","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching","primary_cat":"cs.CL","submitted_at":"2024-11-29T05:57:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BatchLLM achieves 1.3x-10.8x higher throughput than vLLM and SGLang for batched LLM inference with prefix sharing via global prefix identification, decoding-first reordering, and memory-centric token batching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.01889","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Ring Attention with Blockwise Transformers for Near-Infinite Context","primary_cat":"cs.CL","submitted_at":"2023-10-03T08:44:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"books and high-resolution images to analyzing long videos and complex codebases. They excel at extracting information from the interconnected web and hyperlinked content, and are crucial for handling complex scientific experiment data. There have been emerging use cases of language models with significantly expanded context than before: GPT-3.5 [32] with context length 16K, GPT-4 [29] with context length 32k, MosaicML's MPT [25] with context length 65k, and Anthropic's Claude [1] with context length 100k. Driven by the significance, there has been surging research interests in reducing memory cost. One line of research leverages the observation that the softmax matrix in self-attention can be computed without materializing the full matrix [24] which has led to the development of blockwise"},{"citing_arxiv_id":"2307.08691","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning","primary_cat":"cs.LG","submitted_at":"2023-07-17T17:50:36+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlashAttention-2 achieves roughly 2x speedup over FlashAttention by parallelizing attention across thread blocks and distributing work within blocks, reaching 50-73% of theoretical peak FLOPs/s on A100 GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2205.14135","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness","primary_cat":"cs.LG","submitted_at":"2022-05-27T17:53:09+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on sequences up to 64K long.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[58] Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, et al. Mlperf training benchmark.Proceedings of Machine Learning and Systems, 2:336-349, 2020. [59] Frank McSherry, Michael Isard, and Derek G Murray. Scalability! but at whatfCOSTg? In 15th Workshop on Hot Topics in Operating Systems (HotOS XV), 2015. [60] Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867, 2018. [61] NVIDIA. Nvidia Tesla V100 GPU architecture, 2017. [62] NVIDIA. Nvidia A100 tensor core GPU architecture, 2020. [63] NVIDIA. Nvidia H100 tensor core GPU architecture, 2022. [64] D Stott Parker. Random butterﬂy transformations with applications in computational linear algebra."}],"limit":50,"offset":0}