Express Language Modeling

Albert Gong; Annabelle Michael Carrell; Lester Mackey; Raaz Dwivedi

arxiv: 2606.10944 · v1 · pith:EW3FIAUKnew · submitted 2026-06-09 · 💻 cs.LG · cs.DS· math.ST· stat.ME· stat.ML· stat.TH

Express Language Modeling

Albert Gong , Annabelle Michael Carrell , Raaz Dwivedi , Lester Mackey This is my paper

classification 💻 cs.LG cs.DSmath.STstat.MEstat.MLstat.TH

keywords approximationexpressattentioncausalcompressiondecodingguaranteeslanguage

0 comments

read the original abstract

We introduce a new tool, Express, for converting a non-causal attention approximation into a causal approximation with matching approximation guarantees. When combined with the state-of-the-art Thinformer approximation, Express improves upon the best known causal attention guarantees, delivering $\log^{3/2}(n)/s$ approximation error with only $O(s)$ memory and $O(s^2 \log^2(n))$ compression overhead for a sequence of length $n$. We pair these developments with an efficient I/O-aware Triton implementation, demonstrate substantial speedups over FlashAttention 2, and use Express to overcome four resource bottlenecks in the language modeling pipeline: long-context prefill, KV cache compression, long-form memory-constrained decoding, and long-form compute-constrained decoding.

This paper has not been read by Pith yet.

Express Language Modeling

discussion (0)