hub Canonical reference

Online normalizer calculation for softmax

Maxim Milakov, Natalia Gimelshein · 2018 · cs.PF · arXiv 1805.02867

Canonical reference. 100% of citing Pith papers cite this work as background.

31 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 31 citing papers arXiv PDF

abstract

The Softmax function is ubiquitous in machine learning, multiple previous works suggested faster alternatives for it. In this paper we propose a way to compute classical Softmax with fewer memory accesses and hypothesize that this reduction in memory accesses should improve Softmax performance on actual hardware. The benchmarks confirm this hypothesis: Softmax accelerates by up to 1.3x and Softmax+TopK combined and fused by up to 5x.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

cs.LG · 2026-06-18 · unverdicted · novelty 8.0

StreamKL is the first fused GPU primitive for attention KL divergence that reduces memory from O(N_Q N_K) to O(1) via an online one-pass formulation and tile-wise recomputation.

Mesh Inference: A Formal Model of Collective Inference Without a Center

cs.MA · 2026-06-17 · unverdicted · novelty 8.0

Mesh inference allows a network of agents to reach the centralized optimum through local relaxations of a coupled free energy using only admitted observations, with convergence guaranteed by M-matrix properties in the linear-Gaussian regime.

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

cs.DC · 2026-05-31 · unverdicted · novelty 7.0

On a real multi-node H100 cluster the authors show that for MLA, routing the ~1 KB compressed query row is cheaper than moving cache chunks and supply a topology-aware cost model accurate to ~7% on IBGDA fabrics.

Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

Step-TP is a dataset providing grounded, atomic step-level IR transitions and CoT supervision to enable reliable multi-step LLM-guided tensor program optimization instead of end-to-end imitation.

FlashSinkhorn: IO-Aware Entropic Optimal Transport on GPU

cs.LG · 2026-02-03 · conditional · novelty 7.0

FlashSinkhorn delivers up to 32x forward and 161x end-to-end speedups for entropic OT on A100 GPUs via IO-aware Triton kernels that fuse log-domain updates and streaming transport application.

Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights

cs.LG · 2026-02-01 · unverdicted · novelty 7.0

MiTA makes attention scalable by gathering query-aware top-k key-value pairs through landmarks as deformable routed experts and compressing the N-width fast-weight MLP into a shared narrower expert.

QFlash: Bridging Quantization and Memory Efficiency in Vision Transformer Attention

cs.LG · 2026-04-28 · unverdicted · novelty 7.0

QFlash implements end-to-end integer FlashAttention with integer-only softmax, delivering up to 8.69x speedup and 18.8% energy savings on ViT models while preserving accuracy under per-tensor quantization.

Fast Cross-Operator Optimization of Attention Dataflow

cs.AR · 2026-04-03 · unverdicted · novelty 7.0

MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.

Ring Attention with Blockwise Transformers for Near-Infinite Context

cs.CL · 2023-10-03 · unverdicted · novelty 7.0

Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

cs.LG · 2022-05-27 · accept · novelty 7.0

FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on sequences up to 64K long.

Scalable Physics-Inspired Transformers for Spin Glasses

cond-mat.dis-nn · 2026-06-22 · unverdicted · novelty 6.0

A physics-inspired transformer with sparse attention and FlashAttention enables up to 100x faster sampling of large spin-glass systems, providing distributions, free energies, and overlaps for SK and EA models where prior ML methods fail at some temperatures.

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

KForge uses dual LLM agents for cross-platform kernel generation, reporting 2.12% throughput gain on NVIDIA B200 vs TensorRT-LLM and 5.13x geometric mean speedup on Intel Arc B580 vs PyTorch on 37 workloads.

S2O: Early Stopping for Sparse Attention via Online Permutation

cs.LG · 2026-02-26 · unverdicted · novelty 6.0

S2O uses online permutation and importance-based early stopping to increase effective sparsity in attention, delivering 7.51x attention and 3.81x end-to-end speedups on Llama-3.1-8B at 128K context with preserved accuracy.

Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

cs.LG · 2025-11-03 · unverdicted · novelty 6.0

Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.

Faster and Memory-Efficient Training of Sequential Recommendation Models for Large Catalogs

cs.IR · 2025-08-13 · accept · novelty 6.0

CCE- is a Triton kernel implementation of cross-entropy loss with negative sampling that reduces memory by more than 10x and accelerates training by up to 2x for large-catalog sequential recommenders.

Test-Time Training Done Right

cs.LG · 2025-05-29 · conditional · novelty 6.0

Large-chunk online updates during inference let test-time training scale state capacity to 40% of model size and handle contexts up to 1M tokens without custom kernels.

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

cs.CL · 2024-11-29 · unverdicted · novelty 6.0

BatchLLM achieves 1.3x-10.8x higher throughput than vLLM and SGLang for batched LLM inference with prefix sharing via global prefix identification, decoding-first reordering, and memory-centric token batching.

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

cs.AR · 2026-04-27 · unverdicted · novelty 6.0

Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.

ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

cs.LG · 2026-04-26 · unverdicted · novelty 6.0

ELSA casts online softmax attention as a prefix scan over monoid (m,S,W) to deliver exact FP32 semantics, O(n) memory, O(log n) depth, and Tensor-Core independence as a drop-in kernel.

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

cs.LG · 2026-04-25 · conditional · novelty 6.0

CuTile achieves up to 2.5x FlashAttention-2 throughput on B200 with 60 lines of Python but shows significant cross-architecture portability gaps, reaching only 53% of FlashAttention-2 on RTX PRO 6000.

Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon

cs.LG · 2026-04-18 · unverdicted · novelty 6.0

Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.

ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants

cs.DC · 2026-04-16 · unverdicted · novelty 6.0

Argus generates GPU kernels achieving 99-104% of hand-optimized throughput on key LLM kernels by enforcing compile-time data-flow invariants via a tag-based DSL and an in-context RL planner.

AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems

cs.CR · 2026-04-03 · unverdicted · novelty 6.0

AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Scalable Physics-Inspired Transformers for Spin Glasses cond-mat.dis-nn · 2026-06-22 · unverdicted · none · ref 69 · internal anchor
A physics-inspired transformer with sparse attention and FlashAttention enables up to 100x faster sampling of large spin-glass systems, providing distributions, free energies, and overlaps for SK and EA models where prior ML methods fail at some temperatures.

Online normalizer calculation for softmax

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer