pith. sign in

hub Canonical reference

Online normalizer calculation for softmax

Canonical reference. 100% of citing Pith papers cite this work as background.

31 Pith papers citing it
Background 100% of classified citations
abstract

The Softmax function is ubiquitous in machine learning, multiple previous works suggested faster alternatives for it. In this paper we propose a way to compute classical Softmax with fewer memory accesses and hypothesize that this reduction in memory accesses should improve Softmax performance on actual hardware. The benchmarks confirm this hypothesis: Softmax accelerates by up to 1.3x and Softmax+TopK combined and fused by up to 5x.

hub tools

citation-role summary

background 6

citation-polarity summary

roles

background 6

polarities

background 6

clear filters

representative citing papers

Mesh Inference: A Formal Model of Collective Inference Without a Center

cs.MA · 2026-06-17 · unverdicted · novelty 8.0

Mesh inference allows a network of agents to reach the centralized optimum through local relaxations of a coupled free energy using only admitted observations, with convergence guaranteed by M-matrix properties in the linear-Gaussian regime.

FlashSinkhorn: IO-Aware Entropic Optimal Transport on GPU

cs.LG · 2026-02-03 · conditional · novelty 7.0

FlashSinkhorn delivers up to 32x forward and 161x end-to-end speedups for entropic OT on A100 GPUs via IO-aware Triton kernels that fuse log-domain updates and streaming transport application.

Fast Cross-Operator Optimization of Attention Dataflow

cs.AR · 2026-04-03 · unverdicted · novelty 7.0

MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.

Scalable Physics-Inspired Transformers for Spin Glasses

cond-mat.dis-nn · 2026-06-22 · unverdicted · novelty 6.0

A physics-inspired transformer with sparse attention and FlashAttention enables up to 100x faster sampling of large spin-glass systems, providing distributions, free energies, and overlaps for SK and EA models where prior ML methods fail at some temperatures.

S2O: Early Stopping for Sparse Attention via Online Permutation

cs.LG · 2026-02-26 · unverdicted · novelty 6.0

S2O uses online permutation and importance-based early stopping to increase effective sparsity in attention, delivering 7.51x attention and 3.81x end-to-end speedups on Llama-3.1-8B at 128K context with preserved accuracy.

Test-Time Training Done Right

cs.LG · 2025-05-29 · conditional · novelty 6.0

Large-chunk online updates during inference let test-time training scale state capacity to 40% of model size and handle contexts up to 1M tokens without custom kernels.

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • Scalable Physics-Inspired Transformers for Spin Glasses cond-mat.dis-nn · 2026-06-22 · unverdicted · none · ref 69 · internal anchor

    A physics-inspired transformer with sparse attention and FlashAttention enables up to 100x faster sampling of large spin-glass systems, providing distributions, free energies, and overlaps for SK and EA models where prior ML methods fail at some temperatures.