arxiv: 2603.15854 · v2 · submitted 2026-03-16 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

FlashSampling: Fast and Memory-Efficient Exact Sampling

Tomas Ruiz , Zhen Qin , Yifan Zhang , Xuyang Shen , Yiran Zhong , Mengdi Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords samplinglogitsLM headtensor parallelismGumbel noisekernel fusionexact samplingdecoding efficiency

0 comments

The pith

FlashSampling fuses exact categorical sampling into the LM-head matrix multiply so the full logits tensor is never written to HBM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlashSampling, a primitive that performs Gumbel-max sampling inside the final matmul of a language model. Logits are produced tile by tile on chip; only the per-tile maximizer is kept, and a small cross-tile reduction yields the sample. In tensor-parallel decoding the usual all-gather of logits is replaced by streaming peer-to-peer writes that overlap communication with ongoing computation. The resulting kernel runs faster than separate sampling kernels on H100-class GPUs and reduces end-to-end time per output token by up to 10 percent in vLLM. Exactness follows directly from the fact that argmax decomposes over vocabulary partitions and that the grouped hierarchical version preserves the categorical distribution.

Core claim

FlashSampling is an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method computes logits tile-by-tile on chip, adds Gumbel noise, keeps only one maximizer per row and per vocabulary tile, and finishes with a small reduction over tiles. In tensor-parallel decoding it replaces the all-gather of logits with streaming peer-to-peer writes that overlap GPU-to-GPU communication with computation and HBM loads. The kernel remains exact because argmax decomposes over partitions and grouped variants preserve exact categorical sampling by hierarchical factorization.

What carries the argument

Tile-wise on-chip argmax after Gumbel noise, followed by a lightweight cross-tile reduction, that fuses sampling into the LM-head epilogue.

If this is right

Kernel-level speedups appear on decode workloads across H100, H200, B200 and B300 GPUs.
End-to-end time per output token drops by up to 10 percent in vLLM experiments.
Tensor-parallel scaling remains near-ideal at large batch sizes because communication overlaps computation across up to 8 GPUs.
The bandwidth-bound sampling step is absorbed into the matmul epilogue without approximation.
Grouped hierarchical sampling stays exactly equivalent to the original categorical distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Memory savings from avoiding the full logits tensor could allow larger batch sizes or longer contexts on the same hardware.
The same tile-wise maximizer pattern might extend to other post-matmul operations such as top-k or nucleus sampling.
Peer-to-peer streaming could replace other collective operations in distributed decoding pipelines.
The technique may become more valuable as vocabulary sizes continue to grow in newer models.

Load-bearing premise

Argmax over vocabulary tiles processed independently on chip equals the global argmax with no numerical discrepancy or edge-case failure.

What would settle it

Run both FlashSampling and a standard full-logits Gumbel sampler on identical inputs and observe whether the output token distributions differ for any model or batch.

read the original abstract

Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over tiles. In tensor-parallel decoding, FlashSampling replaces the all-gather of logits with streaming peer-to-peer writes: This overlaps GPU-to-GPU communication with computation and HBM loads across up to 8 GPUs, with near-ideal scaling at large batch sizes. Our kernel is exact because argmax decomposes over partitions; grouped variants for online and tensor-parallel settings are exact by hierarchical factorization of the categorical distribution. FlashSampling demonstrates kernel-level speedups on decode workloads across 4 different datacenter GPUs (H100, H200, B200, B300), and in end-to-end vLLM experiments, it reduces time per output token by up to $10\%$ on the models we test. These results show that exact sampling, with no approximation, can be integrated into the matmul itself, consolidating the bandwidth-bound sampling step in an efficient epilogue.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlashSampling fuses exact Gumbel sampling into the LM-head matmul to skip materializing logits, with solid kernel speedups and a clean algebraic argument, but low-precision numerical exactness still needs direct verification.

read the letter

The core contribution is a fused kernel that computes the LM-head matmul tile by tile on chip, adds fresh Gumbel noise per tile, keeps only the local maximizer, and reduces across the few surviving candidates. In tensor-parallel decoding it replaces the full logits all-gather with streaming peer-to-peer writes that overlap communication with the ongoing matmul. Both pieces are new as a single primitive and directly address the memory-bandwidth wall that appears once vocab size exceeds a few hundred thousand tokens.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes FlashSampling as a method to perform exact categorical sampling by fusing it into the language model head's matrix multiplication. This avoids writing the full logits tensor to high-bandwidth memory by processing vocabulary tiles on-chip, adding Gumbel noise locally, retaining only the argmax per tile, and reducing across tiles. For tensor-parallel decoding, all-gather is replaced by peer-to-peer streaming writes. Exactness is argued via the algebraic property that the global argmax decomposes over partitions, with hierarchical factorization for grouped and online variants. Empirical results include kernel-level speedups on H100, H200, B200, and B300 GPUs, plus up to 10% reduction in time-per-token in vLLM end-to-end tests.

Significance. If the numerical stability concerns are resolved, this approach could meaningfully accelerate memory-bound sampling steps in large-vocabulary LLM decoding without sacrificing correctness. The fusion into the matmul epilogue and the communication-computation overlap in tensor-parallel settings represent practical engineering advances. The work provides concrete measurements across multiple GPU architectures, which strengthens its applicability claims.

major comments (1)

[Abstract and method description] The central exactness claim depends on argmax decomposition holding in the target low-precision arithmetic. However, independent per-tile matmul accumulations in BF16/FP16 followed by on-chip Gumbel addition and max reduction may produce a different index than a materialized FP32 computation due to rounding variations; the manuscript provides no mismatch rate measurements or stability analysis to confirm bit-exact equivalence.

minor comments (2)

[Experimental evaluation] Speedup results across GPU generations would benefit from reporting variance or multiple trials, as GPU kernel performance can vary.
[Abstract] The specific model sizes, vocabulary sizes, and workload parameters (batch size, sequence length) for the vLLM experiments should be explicitly stated to allow replication of the 10% improvement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on the manuscript. The primary concern raised regarding numerical stability and exactness in low-precision arithmetic is addressed below. We agree that additional empirical validation would strengthen the claims and will incorporate the requested analysis in the revised version.

read point-by-point responses

Referee: [Abstract and method description] The central exactness claim depends on argmax decomposition holding in the target low-precision arithmetic. However, independent per-tile matmul accumulations in BF16/FP16 followed by on-chip Gumbel addition and max reduction may produce a different index than a materialized FP32 computation due to rounding variations; the manuscript provides no mismatch rate measurements or stability analysis to confirm bit-exact equivalence.

Authors: We thank the referee for highlighting this important subtlety. FlashSampling's exactness claim is with respect to the low-precision (BF16/FP16) arithmetic used in the standard LM-head matmul, not a hypothetical FP32 materialization. The algebraic decomposition of argmax over partitions holds in any arithmetic where the max operation is well-defined, including the floating-point operations performed on-chip. To directly address potential discrepancies arising from rounding variations, we will add a dedicated stability analysis section to the revised manuscript. This will include mismatch rate measurements comparing FlashSampling outputs to a reference low-precision (BF16/FP16) implementation that materializes the full logits tensor, as well as comparisons against FP32 where relevant. These measurements will be performed on representative models and input distributions to quantify any differences. revision: yes

Circularity Check

0 steps flagged

No circularity; exactness follows from standard argmax decomposition over partitions

full rationale

The paper's derivation of exact sampling relies on the algebraic identity that argmax over a union of vocabulary partitions equals the maximum of the per-partition argmaxes (and the hierarchical extension for grouped variants). This identity is a direct mathematical property of the argmax operator and does not reduce to any fitted parameter, self-citation chain, or ansatz introduced by the authors. No equation or claim in the provided text equates a derived quantity to its own inputs by construction, and the kernel implementation simply applies this identity to avoid materializing the full logits tensor. The result is self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the standard mathematical property that the global argmax equals the argmax of per-partition argmaxes; no free parameters are introduced, no new entities are postulated, and no ad-hoc assumptions beyond ordinary floating-point arithmetic are required.

axioms (1)

standard math argmax over a disjoint union of sets equals the argmax of the per-set argmax values
Invoked to guarantee that tiled on-chip computation yields the identical sample as materializing the full logits tensor.

pith-pipeline@v0.9.0 · 5554 in / 1401 out tokens · 49887 ms · 2026-05-15T09:57:45.197851+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards a Data-Parameter Correspondence for LLMs: A Preliminary Discussion
cs.LG 2026-04 unverdicted novelty 4.0

A data-parameter correspondence unifies data-centric and parameter-centric LLM optimizations as dual geometric operations on the statistical manifold via Fisher-Rao metric and Legendre duality.